DAN  WOODS 
JACK  PARK 


THINKALONG  SOFTWARE,  INCORPORATED 
16740  WILLOW  GLEN  ROAD 
BROWNSVILLE  CA  95919 


JANUARY  1994 

FINAL  REPORT  FOR  12/01/91-12/01/93 


APPROVED  FOR  PUBLIC  RELEASE;  DISTRIBUTION  IS  UNLIMITED. 


DTIC 

ELECTE 
FEB  2  8 1994 


BIICQ,!;  7 


MATERIALS  DIRECTORATE 

WRIGHT  LABORATORY 

AIR  FORCE  MATERIEL  COMMAND 

WRIGHT  PATTERSON  AFB  OH  4S433-7734 


94-06399 


9 


2 


NOTICE 


WHEN  GOVERNMENT  DRAWINGS,  SPECIFICATIONS,  OR  OTHER 
DATA  ARE  USED  FOR  ANY  PURPOSE  OTHER  THAN  IN  CONNECTION 
WITH  A  DEFINITELY  GOVERNMENT-RELATED  PROCUREMENT,  THE 
UNITED  STATES  GOVERNMENT  INCURS  NO  RESPONSIBILITY  OR 
ANY  OBLIGATION  WHATSOEVER.  THE  FACT  THAT  THE  GOVERNMENT 
MAY  HAVE  FORMULATED  OR  IN  ANY  WAY  SUPPLIED  THE  SAID 
DRAWINGS,  SPECIFICATIONS,  OR  OTHER  DATA,  IS  NOT  TO  BE 
REGARDED  BY  IMPLICATION,  OR  OTHERWISE  IN  ANY  MANNER 
CONSTRUED,  AS  LICENSING  THE  HOLDER.  OR  ANY  OTHER  PERSON 
OR  CORPORATION;  OR  AS  CONVEYING  ANY  RIGHTS  OR 
PERMISSION  TO  MANUFACTURE,  USE,  OR  SELL  ANY  PATENTED 
INVENTION  THAT  MAY  IN  ANY  WAY  BE  RELATED  THERETO. 


THIS  REPORT  IS  RELEASABLE  TO  THE  NATIONAL  TECHNICAL  INFORMATION  SERVICE 

(NTIS).  AT  NT1S  IT  WILL  BE  AVAILABLE  TO  THE  GENERAL  PUBLIC.  INCLUDING  FOREIGN  NATIONS. 


THIS  TECHNICAL  REPORT  HAS  BEEN  REVIEWED  AND  IS  APPROVED  FOR  PUBLICATION. 


A 


STEVEN  R.  LECLAIR 

Technical  Director,  Manufacturing  Research 
Integration  and  Operations  Olvislon 
Materials  Directorate 


WALTER  M:  GRIFFITH 
Branch  Chief,  Manufacturing  Research 
Integration  and  Operations  Division 
Materials  Directorate 


ROBERT  L.  RAPSON 

Chief,  Integration  and  Operations  Division 
Materials  Directorate 


IF  YOUR  ADDRESS  HAS  CHANGED,  IF  YOU  WISH  TO  BE  REMOVED  FROM  OUR 
MAILING  LIST,  OR  IF  THE  ADDRESSEE  IS  NO  LONGER  EMPLOYED  BY  YOUR 
ORGANIZATION  PLEASE  NOTIFY  WL/MLIM,  WRIGHT-PATTERSON  AFB, 

OH  45433-7740  TO  HaP  MAINTAIN  A  CURRENT  MAILING  LIST. 

COPIES  OF  THIS  REPORT  SHOULD  NOT  BE  RETURNED  UNLESS  RETURN  IS 
REQUIRED  BY  SECURITY  CONSIDERATIONS,  CONTRACTUAL  OBLIGATIONS,  OR 
NOTICE  ON  A  SPECIFIC  DOCUMENT. 


REPORT  DOCUMENTATION  PAGE 


Tut**  rf  OOPTuvg  Stifdfln  O'  «S«  coi-«:  tor  of  inform*,  uor.  ■*  rvi'r-t'.eo  to  *rt**)t  '  ’'our  per  tioor-it  'rM.ltfl.n5  Us*  time  for  r*v>*w»ng  injirutoo^i.  4i*TCfY.ng  existing  d,t*  iOUCn. 
gjthermg  jrc  m*.nT»-«ng  t*i«  <J*M  *<v>  «Om©iet«*g  i"Kj  reviewing  -S«  coOecfon  of  information  Senfl  comments  rtg»rfl.ng  this  burden  estimate  O'  *ny  other  *sOMt  Of  tN» 

CO’-ect-On  ot  -»*orm*tinn  >n<f«jfl>ng  «<MH  o«s  for  icflucmg  [7  u  Ov»Oen  .0  VVtshmgTOn  He»flQw*rttn  Service*  Directorate  fir  information  Ooer»t>onj  *nd  Kcoort}  1215  refferjon 
Devrs  h«ghw*,  Su*te  12B4  Arlington  wA  222C2-4J02  *nfl to  'he  Officer  Venegemem  «nfl  Suflget  ^tpenrror*  fc*fluCt>onercij«CtOT&AOllt)  W«lw*gtc  DC  23SCI 


1.  AGENCY  USE  ONLY  (leave  blink) 


A.  TITLE  AND  SUBTITLE 


1  6.  AUTHOR(S) 

■  DAN  WOODS 

i  JACK  PARK 


3  REPORT  TYPE  AND  DATES  COVERED 

FINAL  12/01/91—12/01/93 


5  FUNDING  NUMBERS 

C  F3 3615-9 2-C- 5802 
PE  65502 
PR  3005 
TA  05 
WU  17 


7.  PERFORMING  ORGANIZATION  fIAI.«E($)  AND  ADORE  SS(ES) 

THINK ALONG  SOFTWARE,  INCORPORATED 
16740  WILLOW  GLEN  ROAD 
BROWNSVILLE  CA  95919 


!  8.  PERFORMING  ORGANIZATION 
i  REPORT  NUMBER 


9.  SPONSORING /MONITORING  AGENCY  KAKE(S)  AND  ADDRESSEES) 

MATERIALS  DIRECTORATE 
i  WRIGHT  LABORATORY 

AIR  FORCE  MATERIEL  COMMAND 

WRIGHT  PATTERSON  AFB  OH  45433-7734 

10.  SPONSORING /MONITORING 
AGENCY  REPORT  NUMBER 

WL-TR-94-4008 

11.  SUPPLEMENTARY  NOTES 

IJj.  DISTRIBUTION /AVAILABILITY  STATEMENT 

APPROVED  FOR  PUBLIC  RELEASE;  DISTRIBUTION  IS 
UNLIMITED. 

! 

12b  DISTRIBUTION  CODE 

This  has  been  a  three-part  project:  (1)  enhancing  generalized  discovery  tools  on 
our  existing  computer  platform,  (2)  building  specialized  discovery  tools,  and 
(3)  applying  generalized  and  specialized  discovery  tools  to  specific  materlals- 
related  projects  including  protein  studies  and  crystallography. 

During  the  work,  we  have  had  the  opportunity  to  incorporate  some  of  ourrother 
activities.  Including  studies  we  have  done  in  biomedical  fields  using  our  discovery 
system.  We  have  also  folded  in  some  of  our  external  work  in  design.  Portions  of 
this  other  work  are  included  in  this  report  because  it  serves  to  illustrate  key 
points  we  wish  to  make.  This  report  includes  portions  of  contributions  made  by  our 
consultants,  W.B.  Dress  and  A.G.  Jackson. 

We  begin  our  report  with  a  review  of  the  philosophical  aspects  of  what  we  call 
computational  theory  formation.  We  then  present  the  co  mputer  tool  that  has  been 
extended  as  part  of  this  work,  the  goal  of  which  is  to  address  the  issues  brought 
up  in  the  first  section.  Finally,  the  three  different  applications  mentioned  above 


Mae  <•!« 


14.  SUBJECT  TERMS 

ARTIFICIAL  INTELLIGENCE,  DISCGVERY  SYSTEMS, 
POLYMER  COMPOSITE  CURING,  PROTEIN  STRUCTURES 


15  HUMBER  OF  PAGES 

108 

16  PRICE  CODE 


17  SECURITY  CLASSIFICATION 
OF  REPORT 

.  UNCLASSIFIED 


N$N  75*0-0 ;  -230-5500 


20  LIMITATION  OF  ABSTRACT 


StardJ'O  FOffn  298  (Rev  2-89) 

►w  ore  t*  avSi  <;a 

m  U2 


Table  of  Contents 


Chapter  1 . Objectives 

1.1  Overview . 1-1 


1.2 

Discovery . 

.1-1 

1.21 

Ptiosophical  Background:  What  is  Science? 

.1-1 

1  21.1 

Popper’s  Three  Worlds . 

1-2 

1  2.1.2 

Dynamical  Systems  View . . 

.1-2 

1.2.12 

Dong  Science:  The  Modelng  Relation . 

..1-3 

1.2.12 1  Theoretical  and  Empirical  Models . 

.1-3 

1.2.12.2  Computational  Models . 

.1-4 

1.21.4 

The  Simulated  Laboratory . 

..1-4 

1.22 

The  Scientist  «-*  Model  Relation..  . 

..1-4 

1  22.1 

Modeling  The  Scientist . 

.1-5 

1. 2.2.2 

Modeling  The  Model . 

.  1-6 

1.3 

Tools . 

.1-7 

1.3.1 

Qualitative  Modeling . 

..1-8 

1.32 

TSC's  Exploratory  Behaviors  . 

.1-8 

1.321 

Case-based  Reasoning  and  Analogy  1-8 

1.32.2 

Directed  Evolution . 

1-8 

1.32 

Rough  Sets . 

1-9 

1.3  4 

Nearest  Neighbor  Analysis . 

1-9 

1.4 

Applications  —  Proteins . 

.1-9 

1.4.1 

Prediction . 

1-9 

1.42 

Design . 

1-9 

1.5 

Applications  — Crystallography...  . 

1-10 

Chapter  2....  Tool  Building 


21 

The  Scholar's  Companion . 

..2-1 

21.1 

Architectural  Overview . 

.2-1 

212 

The  TSC  Language:  Statements  .. 

.2-2 

2.13 

The  TSC  Knowledge  Base 

Structure:  Taxonomy . 

.  2-3 

2.1.4 

The  TSC  Knowledge:  Rules . 

..2-3 

22 

Model  Building . 

...2-4 

2.3 

Encyclopedia  Behaviors . 

.  2-5 

2.4 

Exploratory  Behaviors  . 

..  2-5 

241 

Case-Based  Reasoning  . 

..2-6 

2.4.1  1 

CBRonTSC  . 

.2-6 

2  4.12 

TSC  Case-Based  Design  Approach 

vs.  Conventional  A1  CBR . 

..2-6 

2.4.1  2.1  Similarities . 

..  2-7 

2.4.12  2  Differences . 

.  2-7 

2.42 

Hypothesis  Formation . 

.  2-8 

2.4  3 

Design . 

.2-9 

244 

Directed  Evolution . 

.2-9 

2  4  4.1 

Directed  Evolution  vs.  Genetic  Algorithms . .. 

2-10 

2  4.4  2 

Algorithm . 

2-10 

2.4.43 

Optimization  Strategy . 

.2-11 

2  444 

2-1? 

24.45 

Mutation . 

.2-12 

2.45 

Genetic  Programming . 

.2-13 

2451 

Algorithm . 

.2-13 

2.45.2 

Survival  and  Reproduction . 

2-14 

2.45  2.1  Survival . . 

.2-14 

2.4  52  2  Program  Reproduction . 

2-14 

Chapter  2  (continued) 

2.4.5.3  Crossover  and  Mutation . 2-15 

2.4.5  3.1  Program  Crossover . 2-15 

2.4.5  3.2  Program  Mutation . 2-15 

2  5  Data  Evaluation  Tools . 2-16 

2  51  Rough  Set  Evaluation  ot 

Genetic  Program  Fitness . 2-16 

25.1.1  Definitions . 2-16 

2.5.1. 2  Evaluating  Attributes . 2-17 

2.5.2  Nearest-Neighbor  Pattern  Recognition . 2-17 

Chapter  3 . Applications 

3.1  Protein  Structure  Prediction . 3-1 

3.1.1  Methodology . 3-1 

3.1.2  Genetic  It-Then  Rule  Generation . 3-2 

3.1.3  Prediction  Program  Generation . 3-4 

314  Testing  Recall  and  Prediction . 3-4 

3  2  Protein  Design . 3-5 

3.2.1  Case-Based  Protein  Design . 3-5 

3.22  Approach . 3-6 

3.2.3  Case-Based  Design  Algorithm . 3-8 

3.3  TEM  Control . 3-11 

3.3.1  TEM  Control  Approach . 3-11 

332  The  TSC  Combined  AnalysisControIer . 3-11 

Chapter  4 . Results 

4.1  Tools 

4.1.1  Discoveiy  Tools . 4-1 

412  Design . 4-1 

4  2  Applications . 4-2 

4.2.1  Proteins . 4-2 

4.2.1 .1  Prediction  by  GA  Rule  Building . 4-2 

4.2.1 .2  Prediction  by  GA  Program  BuikSng . 4-5 

4.2  1  3  Prediction  by  Nearest  Neighbor 

and  Protein  Design . 4-7 

4  3  TEM . 4-10 

43.1  lairchng  the  TEM  Corarolef/SmUalor . 4-10 

4.3.2  Launching  TSC . 4-11 

4.3  3  Initialize  Simulator . 4-1 1 

4.3  4  User  Dialog  With  the  TEM  Similator . 4-13 

Chapter  5 . . . 

Summary  and  Conclusions 

5.1  General  Improvements . 5-1 

5  2  Prediction  System  Improvements . 5-2 

Acknowledgments . 5-4 

References . 5-5 

Appendices 

A  •Nearest  Neighbor"  code . A-1 

B  "Rough  Sets"  code . B-i 

C  "Evolution"  code . C-1 


Chapter  1 


Objectives 


1.1  Overview 

We  discuss  our  ongoing  experience  in  the  development  of  tools  for  research  in  materials  science 
and  engineering.  A  measurable  quantity  of  progress  has  been  made  during  this  SBIR  02  activity, 
but  much  remains  to  be  accomplished  before  we  can  say  that  a  “complete”  set  of  materials 
discovery  tools  has  been  built.  In  our  discussions  here,  it  will  be  revealed  that  we  have  looked  at 
several  important  aspects  of  materials.  Key  is  the  notion  that  we  did  not  restrict  our  tool  building 
to  structural  materials,  but  rather  we  branched  out  to  explore  biological  materials  as  well.  We 
looked  also  into  aspects  of  design,  with  design  of  molecules  a  target  We  see  the  tools  we  are 
building  as  being  appropriate  to  the  entire  materials  science  and  engineering  activities,  starting 
from  concept  discovery  and  ending  with  product  manufacturing  and  support 

This  has  been  and  continues  to  be  a  three-part  project:  (1)  enhancing  generalized  discovery  tools 
on  our  existing  computer  platform,  (2)  building  specialized  discovery  tools,  and  (3)  applying 
generalized  and  specialized  discovery  tools  to  specific  materials-related  projects  including  protein 
studies  and  crystallography. 

During  the  work,  we  have  had  the  opportunity  to  incorporate  some  of  our  other  activities, 
including  studies  we  have  done  in  biomedical  fields  using  our  discovery  system.  We  have  also 
folded  in  some  of  our  external  work  in  design.  Portions  of  this  other  work  are  included  in  this 
repon  because  it  serves  to  illustrate  key  points  we  wish  to  make.  This  report  includes  portions  of 
contributions  made  by  our  consultants,  W.B.  Dress  and  A.G.  Jackson. 

We  begin  our  report  with  a  review  of  the  philosophical  aspects  of  what  we  call  computational 
theory  formation.  We  then  present  the  computer  tool  that  has  been  extended  as  part  of  this  work, 
the  goal  of  which  is  to  address  the  issues  brought  up  in  the  first  section.  Finally,  the  three 
different  applications  mentioned  above  arc  introduced. 

1.2  Discovery 

“There  is  nothing  more  practical  than  a  good  theory.” 

Hilbert 

1 .2.1  Philosophical  Backg  round :  What  is  Science? 

The  ideas  we  wish  to  explore  and  the  paths  we  wish  to  follow  concern  the  world  of  scientific 
discovery  and  how  such  high-level  human  activities  might  be  first  simulated  and  then  aided  by 
computer-based  tools.  The  goal  is  a  tool-building  one,  having  immediate  utility  for  selected 
research  programs,  and  long-term  benefit  for  general  scientific  activities.  This  may  be  viewed  as 
a  next-generation  “intelligence  amplifier”  which  Ashby  hinted  at  nearly  forty  years  ago  [Ashby, 
I960]. 

To  clearly  present  our  case,  we  need  to  say  a  few  words  about  science,  the  task  and  tools  of 
science,  and  its  relationship  to  the  vast  intellectual  structure  that  make  up  its  formal  tools  and 
color  how  we  perceive  and  think  about  the  world.  For  example,  is  the  study  of  molecular 
structure  really  the  study  of  molecules?  Or  is  it  rather  the  study  of  the  recorded  behavior  of 
molecules  as  reported  in  the  literature?  Or  is  it  the  study  of  methods  successfully  used  to  study 
molecules?  What  level  of  abstraction  do  we  deal  with?  When  are  we  doing  science  and  when  are 
we  studying  how  science  is  done? 
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These  are  not  meaningless  questions  as  their  answers  will  determine  how  we  go  about  finding 
new  theories,  how  we  go  about  verifying  them,  and  how  ensuing  predictions  relate  to  reality  and 
hence  new  ideas,  methods,  and  products. 

1.2.1.1  Popper’s  Three  Worlds 

As  a  point  of  departure,  imagine  that  reality  consists  of  three  distinct  worlds  as  illustrated  in 
figure  1-1.  Following  Karl  Popper  [Popper  and  Eccles,  1977],  we  take  these  worlds  to  be 

1 .  The  world  ol  entities  having  existence:  naturally  occurring  entities  and  human  artifacts.  Examples 
indude  molecules,  automobiles,  animals,  and  people. 

2.  The  world  of  mental  states  and  activities  not  directly  perceptible  as  entities  except  by  ourseh/es: 
yet  we  know  that  they  must  exist. 

3.  The  world  of  mental  constructs  (whether  ■correct'  or  not).  Here,  we  find  political  theories,  plans  of 


Figure  1-1:  Possble  relationships  among  Poppers  three  worlds. 

Note  that  this  scheme  is  a  convenience  foT  our  discourse,  and  not  meant  to  be  a  philosophical 
position  to  be  debated  or  defended.  It  merely  serves  as  a  point  of  departure.  We  simply  wish  to 
talk  about  reality,  its  reflection  in  the  modeling  relation,  and  the  several  levels  of  models  needed 
to  describe  that  reality.  Thus,  the  view  of  science  as  an  activity  of  World  1  entities  building 
World  3  constructs  by  means  of  World  2  activities  for  elucidating  World  1  relationships  is  the 
real  subject  of  our  presentation. 

We  are  not  going  to  discuss  the  couplings  between  the  worlds,  except  to  note  that  they  must 
exist 

0.1.2  Dynamical  Systems  View 

We  will  assume  the  sufficiency  of  the  Newtonian  Paradigm  that  generalized  forces,  coordinates, 
and  velocities  are  all  that  are  necessary  to  describe  the  universe  to  any  degree  of  accuracy.  The 
higher  derivatives  are  not  needed.  Thus,  the  model  is 

x  =  f[x,  a,  t] 
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where  x  represents  the  generalized  coordinates  (i.e.,  coordinates  and  velocities  in  panicle  descrip¬ 
tion),  a  represents  any  parameters  (e.g.,  masses),  t  represents  the  time  parameter,  and  dot  denotes 
time  derivative.  The  generalized  coordinates  are  observables  of  the  system  and  their  range 
comprises  the  state  space  of  the  system,  f  is  a  vector  of  real-valued  functions  on  the  state  space, 
and  any  other  observable  of  the  system  is  represented  by  some  fL  This  is  the  most  general 
dynamical  system  that  we  will  consider,  in  fact,  this  viewpoint  may  be  said  to  comprise  what  we 
mean  by  “science.”  Note  that  all  of  physics  (even  quantum  mechanics),  chemistry,  and  much  of 
biology  fits  into  this  scheme.  The  proof  is  the  unparalleled  success  of  the  scientific  method  over 
the  last  three  centuries  and  the  explosion  in  activity  of  the  latter  half  of  this  one. 

The  study  of  dynamical  systems  then  becomes  a  study  of  points  and  trajectories  in  the  state 
space.  Topology  may  also  be  studied  when  ensembles  of  state-space  points  are  important  (as  in 
statistical  mechanics). 

Any  of  the  standard  models  of  science  can  be  cast  in  the  dynamical  systems  formalism,  and  the 
diffeirnce-equation  representation  can  serve  as  a  computationally  efficient  model  of  the  formal 
dynamical  system.  We  will  call  on  this  form  of  dynamical  systems  modeling  exclusively  when 
we  explore  the  construction  of  models  from  collections  of  finite-state  automata. 


1.2.1 3  Doing  Science:  The  Modeling  Relation 

The  mam  activity  of  science  is  the  search  for  correct  and  useful  models  of  natural  systems.  Once 
a  model  has  been  established,  it  becomes  a  tool  for  further  scientific  exploration  and  discovery. 
Along  the  way,  any  given  model  will  be  revised,  refined,  and  perhaps  discarded.  This  is  the 
normal  progression  of  science. 


As  depicted  in  figure  1-2,  a  natural  system  is  encoded  into  a  formal  system  (the  model)  by  means 
of  observations  and  measurements.  Inferences  are  made  on  the  model  using  the  formal  rules 
(e.g.,  mathematics).  Results  are  then  decoded  into  predictions  on  the  natural  system. 


Natural 

System 


decoding 


Formal 

System 


encoding 


rules 

of 

inference 


Figure  1-2:  Schematic  of  the  general  modeling  relation  (alter  Rosen). 

If  we  wish  to  study  science  with  the  goal  of  providing  a  set  of  advanced  simulation  tools  for 
aiding  the  scientist  in  his  quest,  we  must  first  study  and  understand  the  modeling  relation.  We 
will  show  later  that  this  study  must  be  expanded  to  include  a  model  of  the  scientist  and  his/her 
activities  as  well  as  a  model  of  the  scientific  theories  and  process. 

1.2.1.3.1  Theoretical  and  Empirical  Models 

The  standard  model  of  science  is  theoretical  in  the  sense  that  the  model  builder  relies  on  a 
well-known  theory  of  measurement  and  observation  and  uses  standard  mathematical  techniques 
to  describe  and  codify  the  results  of  the  measurements.  The  next  stages  are  likewise  theoretically 
based  as  the  scientist  attempts  to  construct  a  deductive  system  whose  goal  is  to  provide  predictions 
of  behaviors  in  World  1. 

Empirical  models  may  result  during  the  course  of  a  laboratory  experiment  wherein  certain  parts 
and  functions  of  the  natural  system  under  study  are  replaced  by  contrivances  or  held  fixed  while 
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other  parts  are  observed  and  measured  under  controlled  constraints.  This  tight  interaction  between 
theory  and  experiment  is  where  most  of  today’s  science  takes  place. 

1 2.13 2  Computational  Models 

It  is  only  recently  that  the  third  type  of  model  became  possible.  A  computational  model  is  a 
novel  mixture  of  the  theoretical  and  empirical  models.  It  is  theoretical  in  that  it  is  based  on  the 
formal  notions  of  mathematics  and  logic,  and  empirical  in  that  the  computation  must  actually  be 
carried  out  in  many  cases.  Here,  we  must  distinguish  computation  that  simulates  a  system  under 
study  from  calculation  that  obtains  a  solution  to  a  differential  equation,  for  example.  The  results 
of  criculations  have  been  used  for  centuries  prior  to  the  invention  of  the  computer,  but  complex 
simulations  of  certain  systems  need  the  computer  to  become  feasible.  After  all,  the  closed-form 
or  a  numerical  function  as  an  answer  is  the  exception.  Certain  simulations  are  then  beyond  the 
reach  of  the  “standard”  modeling  techniques  of  differential  equations  and  can  only  be  carried  out 
on  a  suitable  simulator  (e.g.,  a  digital  computer). 

An  example  of  a  computational  model  that  rarely  has  a  solution  obtainable  by  calculations  in  the 
usual  sense  is  provided  by  boolean  switching  networks.  Theory  shows  that  limit  cycles  arc  to  be 
expected,  but  which  limit  cycles  under  which  initial  conditions  can  only  be  determined  by  a 
computational  model  once  a  certain  complexity  of  the  network  is  surpassed.  Such  switching 
networks  can  provide  a  model  of  genetic  behavior  during  organism  growth  (the  differentiation 
process).  The  opeton  model,  in  particular,  is  amenable  to  the  switching-network  description.  To 
obtain  answers  to  questions  of  evolutionary  behavior  requires  that  the  system  be  simulated  since 
no  known  short-cut  exists.  This  point  has  been  formally  discussed  by  Wolfram  [1988]. 

L2.1.4  The  Simulated  Laboratory 

Doing  science  by  computer  must  ultimately  and  intimately  involve  an  integrated  cybernetic 
system  designing  and  conducting  actual  experiments  in  a  real-world  (World  1)  laboratory.  The 
system  will  be  an  extension  of  the  human  scientist— a  cybernetic  “graduate  student”  with  instant 
access  to  sophisticated  scientific  databases,  a  suite  of  standard  scientific  methods,  and  a  repertoire 
of  laboratory  techniques.  Such  a  process  requires  a  sophisticated  set  of  effectors  and  affectors 
that  we  do  not  yet  possess.  The  next  best  thing  is  to  provide  a  simulated  laboratory. 

1 12  The  Scientist  •*-*•  Model  Relation 

Since  we  effectively  have  all  three  worlds  at  our  disposal,  we  can  look  at  and  dunk  about  the 
relationship  between  the  scientist  and  his  world  of  study.  This  is  exactly  the  system  we  need  to 
study  and  model  to  achieve  our  goal  of  building  a  tool  for  doing  science  by  computer.  To  cany 
out  such  an  ambitious  program,  we  will  need  to  model  the  scientist  (no  mean  task)  and  the 
systems  under  study  (much  easier,  but  certainly  non-  trivial). 

To  model  the  systems  under  study  requires  that  we  must  obtain  effective  models  of  scientific 
models  of  natural  systems.  We  are  necessarily  one  step  removed  from  the  realm  of  the  natural  or 
physical  scientist  and  must  remain  aware  of  that  fact  in  all  that  follows.  To  forget  will  cause  us  to 
confuse  our  models  and  constructs  with  first-order  scientific  theories,  when  they  are  theories  of 
theories  and  theories  of  behavior.  In  effect,  what  we  are  doing  is  studying  relationships  that  map 
World  3  onto  World  3,  whereas  the  (first-order)  scientist  is  studying  relationships  mapping 
World  1  onto  World  1  by  means  of  constructs  in  World  3.  This  activity  is  illustrated  in  figure 
1-3,  which  shows  how  the  scientist  builds  entities  in  World  3  from  observations  (World  2 
entities)  of  World  1  behaviors. 
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Figure  1-3:  Schematic  representation  o!  the  entities  and  relationships  involved  in  the  discoveiy  process 


The  only  sure  guide  we  have  in  this  undertaking  is  the  successful  models  constructed  by  first-order 
science.  These  must  be  our  touchstone  to  the  utility  of  what  we  wish  to  accomplish. 


1.2.2.1  Modeling  The  Scientist 

This  difficult  and  fascinating  subject  is  at  the  core  of  a  following  discussion  of  the  discovery 
system  tools  we  call  The  Scholar’s  Companion  (TSC):  how  to  mode!  scientific  discovery.  Note 
that  numerous  authors  have  been  exploring  such  ideas  over  the  last  decade.  Notable  among  such 
undertakings  are  studies  in  qualitative  physics  [Bobrow,  1985]  and  simulation  of  scientific  discovery 
on  specially  crafted  data  sets  [Langley,  1987;  Thagari  1988].  The  goal  of  TSC  is  to  provide  the 
user  with  a  functional  model  of  the  scientist — a  scientist-computer  that  can  make  hypotheses  and 
perform  useful  “computer”  work  such  as  database  manipulation  and  certain  calculations,  as  well 
as  constructing  and  testing  reasonable  models  germane  to  the  problems  posed  by  the  human 
scientist. 
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Figure  1-4:  Abstraction  of  entities  and  relationships  comprising  the  scientist  and  scientific  activities. 

Figure  1-4  represents  that  portion  of  the  scientific  discovery  process  that  is  to  be  modeled  by 
TSC.  This  shows  what  The  Scholar’s  Companion  needs  to  model  at  its  core.  (The  left  dashed 
arrow  represents  some  portion  of  scientific  observations  required  to  construct  a  theory  or  scientific 
model.  Note  that  World  1  entities  such  as  the  scientist  himself  or  the  actual  systems  under  study 
are  not  required  for  this  abstraction.)  Previous  work,  cited  above,  has  concentrated  only  on  the 
encoding-decoding  relationships  (labeled  “Theory”  and  “Prediction”  in  the  figure),  and  that  on 
rather  limited  and  contrived  data  sets.  Our  discussion  in  this  presentation,  based  on  the  mappings 
among  the  three  worlds,  clearly  shows  that  additional  modeling  is  needed.  Not  only  must  some 
of  the  World  2  entities  be  taken  into  account,  but  the  interfaces  between  World  1  and  2  and 
between  World  2  and  3  are  essential  to  any  successful  machine  representation  of  scientific 
activity.  Classic  artificial  intelligence  (AI)  has  had  some  success  in  modeling  perceptions  and 
thought  processes;  much  of  the  work  discussed  in  Langley  [1987]  involves  these  constructs.  The 
lower-level  World  2  entities  and  the  World  1 — World  2  interface  remain  largely  unexplored 
areas.  As  yet,  no  comprehensive  system  as  depicted  in  figure  1-4  has  been  attempted. 

1.2JL2  Modeling  The  Model 

Our  concern  in  the  remainder  of  this  section  will  be  with  the  computational  models  representing 
the  natural  system  under  study.  We  will  need  to  look  at  the  interface  between  these  systems  and 
the  observing  system,  as  well  as  possible  formalisms  for  constructing  the  model  systems.  Figure 
1-5  illustrates  the  relationships  between  the  scientific  discovery  system  and  the  scientist’s  model 
of  the  natural  system.  In  the  figure,  the  scientific  theory  is  represented  in  the  computational 
system  by  the  Computational  Model  module  and  interacts  with  it  by  models  of  the  encoding  and 
decoding  processes. 
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TSC  Interface 


Figure  1-5:  A  modeling  view  of  a  computer-based  system  for  doing  science. 


The  relationships  among  the  model,  the  theory,  and  the  user  are  manifested  in  the  interface 
between  certain  computer  programs.  This  interface  primarily  consists  of  “measures  of  performance” 
of  the  computational  model  and  interpretations  of  these  measures  analogous  to  the  of  affectors 
discussed  above.  An  example  would  be  a  set  of  numbers  returned  to  TSC  representing  a  concen¬ 
tration  of  a  certain  molecular  species  that  was  the  subject  of  the  requested  model.  The  effectors 
consist  of  a  grammar  allowing  TSC  to  specify  types  of  models  to  be  created  in  the  Computational 
Model  module.  This  interface  is  shown  in  figure  1-5  as  the  “TSC  Interface.” 

A  realization  of  these  ideas  is  idealized  in  figure  1-6.  Our  02  project  has  been  to  develop  tools 


Figure  1-6:  A  realization  of  a  toolbox  for  scientific  and  engineering  discovery. 


1.3  Tools 

We  implement  the  philosophical  notions  discussed  above  in  a  toolbox,  introduced  above,  called 
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The  Scholar’s  Companion  (TSC).  TSC  is  being  developed  to  provide  scientists  and  engineers 
with  modeling,  simulation,  control,  and  discovery  tools. 


1.3.1  Qualitative  Modeling 

Tool  building  has  been  the  main  thrust  of  this  02  activity.  Most  important  has  been  the  refinement 
of  our  qualitative  modeling  tools,  called  QPD  [Wood  and  Park,  1950].  Modeling,  as  we  shall  see, 
forms  the  backbone  of  most  of  the  materials  discovery  activities  we  pursue.  We  call  the  internal 
structure  of  a  qualitative  model  an  emisionmenl. 

A  simple  envisionment  (discussed  in  detail  in  the  next  chapter)  from  the  polymer  curing  domain 
that  is  illustrated  in  figure  1-7.  The  initial  conditions  are  that  a  polymer  is  contained  in  an 
autoclave  and  heat  is  applied  by  the  autoclave  heater.  The  polymer  eventually  begins  its  condensation 
curing  reaction  at  which  point  the  IF-THEN  rules  suggest  envisioning  two  different  outcomes  of 
the  process:  a  cured  component  and  a  burnt  component. 


Figure  1-7.  A  simple  envisionmeni. 

Model  building  provides  the  opportunity  to  explore  the  possible  outcomes  of  a  physical  process 
given  some  initial  conditions.  Process  control  discovery  then  involves  finding  a  process  algorithm 
which  encourages  the  desirable  outcome  and  discourages  the  undesirable  outcome. 

1.3.2  TSC’s  Exploratory  Behaviors 

We  are  able  to  take  advantage  of  the  model-building  behaviors  by  asking  “what  if”  questions. 
Such  questions  are  posed  in  goal-onented  statements  we  give  TSC.  There  are  rwc  xploratory 
behaviors  we  are  developing  and  we  discuss  here.  They  are  case-based  reasoning  a,.C  analogy , 
and  directed  evolution. 

1 32.1  Case-based  Reasoning  and  Analogy 

Case-based  reasoning  involves  searching  a  library  database  of  cases  similar  to  the  case  represented 
by  the  qualitative  model  TSC  is  presently  considering.  This  search  may  find  one  or  several  cases 
in  which  portions  of  a  case  may,  by  analogy,  be  found  to  apply  to  the  model  presently  being 
considered. 

1.3.2.2  Directed  Evolution 

The  primary  tool  of  discover  in  our  work  is  an  exploratory  machine  learning  technique  we  call 
directed  evolution.  We  have  applied  directed  evolution  in  three  different  ways  during  this  work: 
(1)  geneuc  algorithm  mutation  to  IF-THEN  rules  used  to  predict  protein  structures,  (2)  genetic 
algorithm  mutation  to  lisp-like  programs  being  designed  to  predict  protein  structures,  and  (3) 
heuristic  mutations  to  design  rules.  The  general  approach  to  directed  evolution  is  select  some 
member(s)  of  a  universe  of  rules  or  objects  in  a  knowledge  base,  clone  and  mutate,  then  study 
the  results. 
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1.3.3  Rough  Sets 

De.ermination  of  the  relative  importance  of  pieces  of  information  and  the  refinement  of  knowledge 
to  dense,  accurate  representation  is  at  the  core  of  much  AI  research.  A  relatively  new  approach  to 
the  analysis  of  large  data  sets  is  Rough  Set  Theory  [Ziarko,  1989].  We  have  implemented  a 
version  of  Rough  Set  Theory  in  the  TSC  toolbox,  and  have  begun  to  apply  it  to  the  problem  of 
protein  structure  analysis. 

1 .3.4  Nearest  Neighbor  Analysis 

The  study  of  molecule  structures  requires  the  ability  to  perform  pattern  recognition  on  sequences 
of  components  of  the  structure.  Amino  acid  sequences  are  found  in  proteins,  and  the  recognition 
of  sequences  which  result  in  different  structures  such  as  helices  and  beta  sheets  is  germane  to  the 
predicaon  of  protein  structures.  We  have  implemented  a  variant  of  the  nearest  neighbor  algorithm 
in  the  TSC  toolbox,  and  have  tested  it  on  a  variety  of  proteins  of  known  conformation. 

1.4  Applications  —  Proteins 

Nearly  all  of  our  tool-building  activities  in  this  02  project  have  been  directed  specifically  at  the 
understanding  of  protein  structures.  Our  task  was  to  select  some  activity  in  the  materials  domain 
and  apply  our  tool-building  skills  to  that  activity;  our  sponsors  requested  that  we  focus  our 
efforts  on  protein  structures. 

Proteins  are  the  building  blocks  of  life  itself,  and  it  turns  out  these  tiny  molecules  have  many 
properties  that  are  interesting  and  potentially  useful  in  applications  other  than  living  tissue.  For 
example,  the  electro-optical  properties  of  specific  protein  structures  suggest  applications  in  optical 
filters.  The  study  of  these  properties  is  of  current  interest  to  our  sponsors,  and  thr  tool  building  of 
this  02  activity  supports  the  sponsor’s  work.  The  overall  fiow  of  protein  analysis  starts  with  the 
analysis  of  amino  acid  sequences,  predicts  the  final  protein  structure,  and  ultimately  designs 
proteins  with  specific  desired  structures.  These  structures  may  be  useful  in  the  electro-optical 
domain,  and  they  may  also  be  useful  in  the  biotechnology  domain,  as  new  disease-fighting  drugs, 
for  example. 

1.4.1  Prediction 

Prediction  of  the  secondary  struemre  of  a  new  sequence  is  performed  by  any  of  a  variety  of 
learning  techniques.  From  the  literature,  approaches  to  the  prediction  of  protein  secondary  structure 
have  included  the  genetic  algorithm  (Unger  and  Moult,  1993],  our  own  approach  (LcClair  et  al., 
1992),  neural  nets  [Qian  and  Sejnowski,  1988],  (Holley  and  Karplus,  1989],  and  statistical 
approaches  which  include  conformational  propensity  parameters  [Chou  and  Fasman,  1978]. 

We  have  explored  two  different  approaches  to  the  genetic  algorithm,  and  have  developed  tools  to 
perfonn  studies  based  on  an  algorithm  known  as  the  nearest  neighbor  algorithm  (Cost  and 
Salzberg,  1993],  (Salzberg  and  Cost,  1992]. 

1.4.2  Design 

How  amino  acid  sequences  specify  a  protein’s  three-dimensional  structure  remains  unanswered 
[DeGrado,  et  al.,  1989).  One  approach  to  gaining  understanding  is  de  novo  design  of  model 
proteins.  This  approach  has  long  been  useful  in  designing  small  molecules.  We  have  extended 
our  protein  study  tools  to  use  the  process  of  analogy  and  analysis  of  proteins  to  design  an 
experimental  protein  structure. 
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1.5  Applications  —  Crystallography 

Synthesis  of  materials  is  a  very  old  problem,  as  is  processing  them  to  produce  a  tool  or  an  object 
of  art.  Although  the  limits  of  our  knowledge  about  materials  has  increased  tremendously  in  the 
last  century,  our  means  for  exploring  the  possibilities  of  designing  materials  is  only  in  the  early 
stages  of  development. 

Examples  of  materials  problems  of  great  interest  axe  those  associated  with  specific  properties  of 
biopolymers,  semiconductors,  and  intermetallics.  Optical  properties  are  of  particular  interest  for 
polymers  and  semiconductors,  and  strength  and  ductility  are  concerns  for  intermetallics. 

The  transmission  electron  microscope  (TEM)  is  an  important  tool  associated  with  the  study  of 
materials  properties.  We  have  explored  the  coupling  of  cur  discovery  tools  to  the  control  of 
experiments  with  a  TEM,  and  to  automating  detection  of  properties  of  materials  with  crystalline 
struemres. 
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Chapter  2  Tool  Building 

The  technical  approach  applied  in  ThinkAlong’s  research  involves  the  coupling  of  computational 
tools  to  problems  involving  exploration  of  datasets.  The  core  of  our  efforts  centers  around  an 
artificial  intelligence  tool  we  developed  in  earlier  work,  in  development  since  the  mid-1980s, 
called  The  Scholar’s  Companion  (TSC),  as  introduced  in  chapter  1.  Its  hardware  and  software 
architectures  are  discussed  in  further  detail  below. 

In  general,  TSC  is  created  to  serve  the  user  with  several  behaviors.  These  include  model  building 
by  applying  specific  knowledge  to  some  given  initial  conditions;  general  purpose  encyclopedia 
behaviors  such  as  using  internal  knowledge  to  answer  user  queries;  exploratory  behaviors  including 
directed  evolution,  genetic  programming,  hypothesis  formation,  and  assisting  the  user  in  creative 
design  tasks;  and  data  evaluation  tools  including  rough  set  analysis  and  nearest  neighbor  analysis. 

These  behaviors  are  discussed  after  the  introduction  to  TSC. 

2. 1  The  Scholar’s  Companion 

2.1.1  Architectural  Overview 

TSC  is  constructed  as  a  message -based  object-oriented  system,  the  architecture  of  which  is 
illustrated  in  figure  2-1.  The  main  data  interaction  with  the  system  begins  at  the  environment, 
flowing  through  encoders  to  a  global  message  list  The  knowledge  base  interacts  with  the  global 
messages,  deleting  old  messages  and  writing  new  ones.  Some  of  the  new  messages  are  decoded 
and  returned  to  the  environment.  This  flow  of  information  is  an  outgrowth  of  the  “expert  system” 
approach  to  artificial  intelligence.  TSC  adds  a  variety  of  i earning  technologies  to  the  expert 
system  approach. 


Figure  2-1 :  TSC  Architecture. 

TSC  is  intended  to  operate  in  networked  computational  environments,  though  it  is  suitable  for 
stand-alone  desktop  application  as  well.  The  network  approach  permits  application  of  a  variety 
of  simulation  tools  and  large  databases  to  the  discovery  process.  This  is  illustrated  in  figure  2-2. 
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Figure  2-2:  TSC  on  a  network 


Pan  of  AI  research  centers  around  schemes  for  representing  in  a  computer  program  the  knowledge 
we  cany  in  our  heads,  i.e.  World  2  models  of  World  1.  Another  pan  of  that  research  looks  for 
algorithms  which  can  apply  the  represented  knowledge  to  some  task.  The  following  discussion  of 
TSC’s  learning  behaviors  considers  knowledge  as  represented  as  “actors,”  their  relationships  and 
states,  and  behaviors  in  some  physical  or  chemical  process. 

TSC  supports  a  knowledge  base  in  a  frame  format.  Frames  represent  concepts  which  include 
actors,  relations,  states,  rules  of  behavior  and  rules  which  represent  physical  and  chemical  processes. 
Each  frame  is  similar  to  a  small  relational  database  entry.  For  example,  the  following  frame 
(adapted  from  [Karp,  1992])  contains  much  of  the  data  on  a  particular  molecule.  It  is  read  by 
forming  “sentences”  from  its  entries.  For  example,  carbon  monoxide  belongs  to  all  kingdoms. 
We  can  also  see  from  the  last  line  that  carbon  monoxide  is  a  compound, 
c:  carbon-monoxide 

display-coords-Zd  (C-0.77  0.0)  C0.77  0.0)) 

structure-bonds  ((2  1  3)) 

structure-atoms  (o  c) 

priority  1 

mesh-ids  'OX. 154. 328”  “Dl.655.498.185"  “D09224S” 

kingdoms  all 

chemical-formula  ((c  1)  (o  1)) 

roots  carbon 

sources  Ihcsdb  mlmavro 

cas-registry-numbers  “630-08-0" 

molecular-weight  28.01 

synonym  CO 

atom-charges  ((2  -1)  (1  1)) 

sub. of  compound 

2.1.2  The  TSC  Language:  Statements 

TSC  writes  messages  for  use  internally,  and  for  use  in  communication  with  the  user.  The  syntax 
of  a  typical  TSC  message  is  based  on  a  sentence  with  a  subject,  predicate,  and  truth;  a  message 
can  also  involve  a  sentence  with  a  subject,  predicate,  object,  and  truth.  Variables  are  words 
which  begin  with  an  asterisk  (*).  These  sentences  are  read  as  illustrated: 

(  predicate  (  subject  )  truth  ) 
eg.  (  ala  C  *x  )  true  ) 

(  predicate  (  subject  object  )  truth  ) 
eg.  (  abuts  (  *x  *y  )  true  ) 

Actors  are  regarded  as  things  which  occupy  space  and  have  mass.  Examples  of  statements  TSC 
can  process  about  actors  include: 
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(  thermal  .mass  (  body. 01  )  true  ) 

C  heat. source  C  autoclave.34  )true  ) 

(  b-cell  (  b-cell.01  )  true  ) 

C  temperature. sensor  (  tl  )  true  ) 

Typical  of  the  statements  TSC  can  process  about  relationships  include: 
(  hotter. than  C  body. 01  body. 02  )  true  ) 

(  hotter. than  C  autoclave. 32  body. 01  )  true  J 
(  abuts  (  body. 01  body. 02  )  true  ) 

(  inside  C  tl  body. 01  )  true  ) 

(  binds  C  b-cell. 01  antigen. 04  )  true  ) 

Typical  statements  TSC  can  process  about  states  include: 

(  increasing  (  tl  )  true  ) 

(  increasing  (  t2  )  false  ) 


2.1.3 


The  TSC  Knowledge  Base  Structure:  Taxonomy 


The  entirety  of  the  knowledge  base  is  stored  in  the  form  of  taxonomies.  A  taxonomy  contains 
infoimadon  on  a  group  of  actors,  the  relationships  they  form  with  each  other,  and  the  states  in 
which  they  may  detected.  A  fragment  of  a  taxonomic  structure  for  the  domain  of  immunology  is 
presented  in  figure  2-3.  Here  we  illustrate  structures  related  to  a  particular  actor  in  immunology, 
the  B-Cell.  This  cell  may  be  detected  as  one  of  several  subspecies,  those  behaving  as  activated, 
and  those  behaving  as  natural  killer  cells.  The  B-Cell  actor  is  mentioned  in  several  process  rules 
and  statements  (as  listed  below),  so  the  taxonomy  includes  this  infoimadon. _ 

0-CELL. ACTIVATION, 1 1 

p-j  ACTIVATED,  B~^irn - CHY .  RULES  1  fFCELL .  CTTOKINE  .RECEPTOBj 

iGSIlH  mACT,t;Eu,pft9gyCTiPKai 


Ib-cell H 


Hiwm.  skier.  55H 

ACTIVATED.  HELPER.T-CELL.I 

I  ANTIGEN.BIHDING.Bl 


KhEMIDl 


H ANTIGEN. INTERNALIZATIOnI 


kAHTKEH,  PR.ESEhlAIiatLSJ 

N ANTIGEN. PROCESSING.bI 


H8-cell.actTvation.iI 

Figure  2-3.  A  fragment  of  the  B-Cell  taxonomy. 


2.1.4  The  TSC  Knowledge:  Rules 

Physical  and  chemical  processes  (e.g.  nucleation,  evaporation,  etc.)  as  well  as  TSC  behaviors  ( 
discussed  below,  e.g.  prediction,  design,  etc.)  are  represented  as  IF-THEN  rules.  A  typical 
structural  prediction  process  rule  from  our  protein  study  exercise  is  illustrated  here: 
c:  OBS.l 

north  680 

IF  actors  are  ALA  and  GLU 
AND  AU  abuts  GLU 
THEN  predict  helical  structure 

A  typical  process  rule  from  a  biomedical  domain  (immunology)  looks  something  like: 


'Dlostrated  in  an  English- lilx  formal  for  readability. 
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C:  ANTIGEN. BINDING. B 
LEVEL 
SUB. OF 
INSTANCE. OF 
CONTEXT 
IF. ACTORS 

IF. RELATES 
IF. NOT. RELATES 
THEN. RELATES 


BASIC 

PHYS. PROCESS 
RULE 

HUMAN. IMMUNOLOGY 
C  (  ANTIGEN  C  ’ANTIGEN  )  TRUE  ) 

C  B-CELL  c  ’B-CELL  )  TRUE  )  ) 

C  C  ABUTS  C  *B-CELL  ’ANTIGEN  )  TRUE  )  ) 

C  C  BINDS  (  ’B-CELL  ’ANTIGEN  >  TRUE  )  ) 

C  C  BINDS  C  ’B-CELL  ’ANTIGEN  )  TRUE  )  ) 


This  rule  says  that  if  you  have  a  B-Cell  abutting  Antigen,  the  B-Cell  then  binds  the  Antigen. 
Firing  such  rules  is  the  process  by  which  TSC  builds  a  qualitative  model  of  a  process.  We  now 
turn  our  discussion  to  these  models. 


2.2  Model  Building 

Knowledge  bases,  as  discussed  earlier,  include  IF-THEN  ru'es  which  describe  physical  processes, 
and  the  taxonomic  knowledge  base  which  includes  information  on  the  actors,  their  relationships, 
and  their  states.  These  knowledge  base  entries  are  then  applied  to  the  construction  of  a  qualitative 
model  of  some  aspect  of  the  domain  represented  by  the  knowledge  base.  The  specific  aspect  of 
the  domain  is  constrained  by  an  entry  supplied  by  the  user,  the  inidal  conditions.  TSC  “fires” 
IF-THEN  rules  which  the  initial  conditions  enable,  building  an  envisionment.  A  stylized 
envisionment  for  the  immunology  knowledge  base  looks  something  like  that  of  figure  2-4. 


Figure  2-4:  Imrrune  system  envisionmenL 

An  envisionment  is  “grown”  by  TSC  when  some  initial  conditions  are  provided.  Initial  conditions 
represent  statements  about  the  initial  actors,  their  initial  relationships,  and  their  starting  states.  A 
set  of  initial  conditions  may  be  provided  by  the  user,  or  it  may  be  supplied  by  TSC  as  it  generates 
its  own  experiments.  The  notion  of  “growing”  an  envisionment  follows  on  the  observation  that 
the  envisionment  is  a  tree-structured  directed  graph,  also  called  a  digraph. 

Directed  graphs  consist  of  nodes  connected  by  arcs,  which  are  represented  by  the  arrows  in  the 
figure,  In  our  parlance,  each  node  is  called  an  episode,  and  each  episode  represents  the  envisioned 
new  state  of  affairs  in  a  process.  Each  episode  is  the  result  of  a  single  process  rule  firing.  Notice 
that  alternate  branches  of  the  tree  are  formed,  as  illustrated  in  figure  2-4.  Alternate  branches 
mean  that  more  than  one  process  may  occur  given  the  same  conditions. 
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Qualitative  models  in  TSC  provide  the  structure  of  theories  the  TSC  uses  to  explain  and  to 
predict  the  outcomes  of  physical  processes  on  given  initial  conditions.  We  illustrate  this  by 
discussion  of  the  reasoning  process  called  abduction,  in  which  hypotheses  may  be  formed  from 
the  qualitative  model  to  explain  some  observation. 

If  a  small  digraph  made  from  the  observation  of  a  physical  process  can  be  “matched”  to  an 
envisionment  we  can  use  abduction  to  hypothesize  that  the  processes  of  the  envisionment  provide 
a  mechanism  to  explain  the  causal  relationships.  For  example,  the  digraph  in  figure  2-5  can  be 
matched  to  a  path  in  the  envisionment  in  figure  2-4;  thus  we  have  reason  to  suspect  that  figure 
2-4  describes  a  mechanism  explaining  figure  2-5. 


Figure  2-5:  Causal  Digraph  Example 

Abduction  in  the  context  of  TSC  is  a  process  of  logic  which  relates  consequences  to  their 
candidate  antecedents.  That  is,  if  we  know  that  some  consequent  B  follows  from  antecedent  A, 
and  we  observe  B,  then  we  can  hypothesize  that  A  is  true.  Specifically,  applying  abductive 
inference,  if  we  know  that  introduction  of  antigen  into  the  immune  system  will  lead  to  antibody 
production  via  a  mechanism  that  includes  cytokine  production  (as  in  figure  2-41,  and  we  “observe” 
causal  relationships  between  introduction  of  antigen,  increasing  cytokine,  and  increasing  antibodies 
(as  in  figure  2-5),  then  we  hypothesize  that  the  observed  causality  is  explained  by  the  known 
mechanism. 

Applying  this  reasoning  in  a  situation  where  the  digraph  of  observations  is  not  found  to  match 
the  envisionment  results  in  an  important  event  in  the  use  of  TSC.  an  expectation  failure  is  said  to 
occur.  Since  the  envisionment  both  explains  and  predicts,  any  time  it  is  unable  to  explain  or 
predict  observations,  the  resulting  expectation  failure  causes  TSC  to  begin  a  set  of  tasks  to  deal 
with  this  new  event. 

Dealing  with  expectation  failures  evokes  several  behaviors  in  TSC  The  initial  behavior  is  to 
alert  the  user,  and  describe  the  nature  of  the  expectation  failure.  Initial  TSC  behaviors  are  then 
guided  by  a  diagnostic  knowledge  base  which  tries  to  determine  if  the  expectation  failure  is 
caused  by,  for  example,  a  sensor  failure.  More  advanced  behaviors  involve  TSC’s  exploration 
tools  which  allow  it  to  conjecture  the  presence  of  an  unknown  (to  the  program)  process  which 
may  be  involved  in  the  observations. 

Model  building  involves  specification  of  the  observables.  These  form  the  initial  conditions  of  an 
envisionmeet.  The  envisionment  then  characterizes  the  linkages  between  the  observables,  which 
TSC  builds  by  firing  process  rules.  Rule  firing  involves  matching  the  “IF-side”  of  an  IF-THEN 
rule  to  the  current  episode  (the  first  one  being  the  initial  conditions).  When  a  match  is  found,  the 
“THEN-side”  of  the  rule  is  used  to  build  a  new  episode.  The  resulting  envisionment  is  an 
expression  of  the  model. 

2.3  Encyclopedia  Behaviors 

TSC  is  designed  for  interactive,  cooperative  exploration  of  some  environment  with  a  user.  Two 
aspects  of  the  TSC  system  enable  interaction  with  the  user.  1)  the  program  will  read  knowledge 
supplied  by  the  user  in  a  frame  format  or  a  “natural  language"  format,  both  from  a  text  file,  and 
2)  the  program  will  accept  knowledge  supplied  by  the  user  in  the  “natural  language”  format 
when  the  user  types  the  new  knowledge  in  TSC’s  “conversation”  window. 
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The  software  which  parses  sentences  typed  by  the  user  in  the  conversation  window  will  accept 
either  statements  (new  knowledge)  or  questions.  A  user  question  causes  the  conversation  software 
to  attempt  to  generate  an  answer. 

This  conversation  facility  has  been  used  to  generate  two  knowledge  bases,  but  has  not  been  used 
in  the  02  activity. 

2.4  Exploratory  Behaviors 

2.4.1  Case-Based  Reasoning 

2.4.L1  CBRoaTSC 

Case-based  reasoning  (CBR)  is  an  outgrowth  of  artificial  intelligence  research.  The  approach 
enables  machine  learning  of  some  environment  by  storing  and  indexing  the  experience  provided 
in  training  exercises.  This  indexing  builds  “cases"  by  which  the  program  may,  during  some  later 
exercise,  notice  similarities  between  the  current  experience,  and  prior  cases.  Mapping  a  prior 
case  to  the  current  situation  involves  reasoning  by  analogy. 

We  have  implemented  a  version  of  CBR  in  which  all  of  the  Brookhaven  Database  proteins  serve 
as  cases  from  which  a  new  protein  design  may  be  generated.  We  now  contrast  our  approach  to 
convential  CBR  techniques. 

2.4.1.2  TSC  Case-Based  Design  Approach  vs.  Conventional  AI  CBR 

While  the  “dialect”  of  case-based  reasoning  (CBR)  used  to  produce  the  proteins  has  similarities 
to  traditional  Al-based  CBR  [Riesbeck  and  Schank,  1989];  it  also  has  some  important  differences. 
The  comparisons  are  presented  below.  The  traditional  CBR  approach  is  based  on  the  needs  of 
powerful  reasoning  systems  used  in  Story  understanding,  and  creative  activities  such  as  design 
and  authorship.  The  TSC  approach  is  based  rather  strictly  on  the  needs  of  molecule  design.  Thus, 
the  TSC  code  may  be  considered  a  specialization  of  the  traditional  CBR  approach.  Traditional  AI 
CBR  is  illustrated  in  figure  2-6,  (after  [Riesbeck  and  Schank,  1989]). 
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Figure  2-6:  Traditional  Case-based  Reasoning  Algorithm. 


H.1.24  SimOiritio 

•  Case  library  used  as  knowledge  base. 

The  majority  of  the  knowledge  of  the  protein  design  system  is  contained  within  die  case  library. 
The  system  doesn’t  know  why  a  certain  sequence  of  amino  acids  forms  a  helix,  it  just  knows  that 
it  does.  It  has  a  set  of  examples  of  helices  without  ever  knowing  what  a  helix  is,  or  how  to  create 
one  from  scratch. 

•  Solving  problems  includes  searching  a  case  libtaty  for  examples. 

When  the  system  is  asked  to  design  a  protein  containing  certain  structures,  it  searches  the  case 
library  for  similar  structures.  This  is  analogous  to  searching  a  traditional  case  library  for  similar 
pi.nc  or  recipes,  or  events,  etc.  The  TSC  case-based  design  system  doesn’t  need  to  know  how  to 
create  a  helix  from  scratch,  because  it’s  seen  a  helix  before  (in  its  case  library),  and  knows 
something  about  what  they  are  composed  of. 

14.1JU  »IfTereac« 

•  No  adaptation  of  cases — a  case  is  always  used  exactly  as  is. 
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A  traditional  CBR  system  will  find  a  similar  case  and  then  adapt  it  to  fit  the  current  goal  of  the 
system.  The  TSC  case-based  design  system  will  find  a  similar  case  and  then  plug  it  in  to  the 
solution  without  altering  the  original  case  at  all. 

•  New  proteins  aren’t  stored  as  cases,  because  they  aren’t  cases. 

A  traditional  CBR  system  will  generate  new  cases  as  it  goes  about  its  process,  and  will  typically 
store  them  for  later  use.  TSC  presently  does  not  generate  a  new  case;  it  merely  combines  old 
cases  in  a  new  order  to  form  a  new  protein. 

•  No  sure  near-term  way  to  determine  effectiveness  of  algorithm 

Because  the  testing  of  the  designs  using  a  structure  prediction  algorithm  is  limited  in  its  accuracy 
to  the  50  to  70  percent  range,  there  is  no  certain  near-term  way  to  determine  if  a  design  was 
successful  or  not.  Therefore,  there  can  be  no  immediate  adaptation  or  learning  from  the  system’s 
failures.  If  TSC  generates  an  incorrect  protein  given  a  certain  case  library  and  specification,  it 
will  generate  that  same  incorrect  protein  every  time. 

2.4.2  Hypothesis  Formation 

In  the  following  example,  we  show,  drawing  from  the  biomedicai  domain,  that  we  can  ask  TSC 
tc  investigate  the  prevention  of  some  outcome.  TSC  explores  candidate  approaches  by  forming 
conjectures  on  ways  to  prevent  the  outcome,  and  then  presents  those  conjectures  to  the  user. 


Figure  2-7  illustrates  an  immune  system  envisionment  in  which  some  antigen  (disease  causing 
agent)  is  introduced  into  the  organism.  We  ask  of  TSC’s  exploratory  behaviors  how  to  prevent  a 
particular  outcome — immune  system  failure.  The  exploratory  behaviors  derive  one  or  more 
conjectures  in  the  form  of  proposed  alternate  branches  of  the  envisionment  tree  structure,  as 
illustrated  in  figure  2-8. 


ThinXAlor.?  Scf£v*r«,  In c. 


Ph*s«  2  Fin*i  Report 


2-8 


TSC's  hypotheses  may  include  several  different  approaches  to  prevent  the  final  outcome.  For  the 
immune  example,  prophylaxis  (prevention  of  antigen  introduction)  may  be  proposed.  Another 
branch  may  propose  an  “antibiotic”  to  inhibit  the  action  of  the  antigen.  These  hypotheses  are 
generated  by  TSC  in  a  variety  of  ways.  The  first  is  to  examine  a  database  of  “cases”  which  may 
be  similar  to  the  case  represented  by  the  current  envisionmenL  There,  the  prophylaxis  hypotheses 
may  be  found  and  applied  as  an  alternate  branch  to  the  current  envisionment.  This  is  an  example 
of  case  based  reasoning  as  described  above. 

TSC  may  look  through  its  collection  of  process  rules  looking  for  a  process  which  might,  by 
analogy,  be  mapped  to  the  existing  situation.  The  classic  example  of  this  (drawn  from  historical 
discussions,  not  from  TSC's  own  experience)  is  the  mapping  of  a  battle  strategy  (divide  the 
forces  up  and  attack  from  different  directions)  to  the  problem  of  reducing  or  eliminating  radiation 
bums  when  radiation  therapy  is  indicated  in  tumor  therapy. 

2.4.3  Design 

We  have  developed  wo  different  approaches  to  design,  both  of  which  are  discussed  elsewhere  in 
this  report  The  first  approach  is  our  application  of  case-based  reasoning  to  design  a  new  protein 
by  analogy  to  a  case  library  of  known  proteins. 

The  second  approach  applies  design  rules  to  a  “design  envisionment”  in  which  die  initial  conditions 
are  a  “seed”  design,  and  design  rules  build  an  envisionment,  always  trying  to  improve  the  design. 
This  approach  requires  that  we  supply  TSC  with  a  simulator  which  is  capable  of  evaluating  each 
new  design. 

2.4.4  Directed  Evolution 

Directed  evolution  (DE)  is  our  name2  for  a  computational  approach  to  discovery  pioneered  by 
Douglas  Lenat  in  his  AM  and  Eurisko  programs  [Lenat,  19d3).  The  approach  involves  performing 
mutations  to  elements  of  a  knowledge  base  and  examining  the  results. 

^e  term  is  borrowed  from  molecular  biology  [Abelson,  1990). 
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2.4.4.1  Directed  Evolution  vs.  Genetic  Algorithms 

One  variant  of  directed  evolution  is  based  on  “heuristically  guided”  machine  learning.  The  other 
variant  is  based  on  random  mutation.  We  have  begun  to  develop  both  approaches.  In  the  random 
mutation  variant,  TSC  applies  a  machine  learning  technique  adapted  from  the  genetic  algorithm 
(GA)  [Goldberg,  1989).  Recent  work  has  applied  the  GA  to  computational  chemistry  and  chemo- 
metrics  [Lucasius  and  Kateman,  1989].  Our  system  extends  the  GA  to  function  in  the  traditional 
symbolic  environments  of  AI. 

John  H.  Holland  seeded  the  creation  of  the  GA  and  wrote  the  seminal  works  on  the  subject  (cf 
[Holland,  1992],  [Holland  et  al„  1986],  [Holland,  1986],  (Holland,  1992],  (Goldberg,  1989], 
(Farmer,  Packard,  &  Perelson,  1986],  and  (Judsor.  &  Rabitz,  1992],  and  (Koza,  1992]).  Holland 
and  his  work  have  been  honored  by  the  1992  MacArthur  Prize.  His  students  have  evolved  the 
algorithm  to  its  present  level  of  power  and  generality.  The  DE  algorithm  is,  roughly  speaking,  a 
slight  generalization  of  the  GA. 

The  GA  serves  as  a  guided  optimization  system  in  that  over  a  number  of  “generations,”  it  selects 
elements  and  combines  them  into  predictive  rules  similar  to  0BS.1  shown  on  page  3.  Rules  are 
then  rated  for  their  predictive  accuracy  during  the  learning  exercise  and  successful  outcomes  are 
reinforced.  Overall  performance  improves  over  time,  since  successful  elements  are  allowed  to 
survive  through  subsequent  generations. 

The  select-combine  algorithm  mimics  evolutionary  processes  such  as  crossover,  point  mutation, 
viral  infection,  and  so  forth.  For  example,  a  pair  of  “strong”  rules  (i.e.  good  predictors)  may  be 
selected  as  “parents”  in  a  crossover  breeding  exercise.  Actors  or  relations  will  be  traded  between 
them  such  that  “child”  rules  are  constructed  from  parts  of  each  parent 

Directed  evolution,  like  a  genetic  algorithm,  is  applied  to  a  population  of  rules  to  evolve  a  more 
successful  population.  Success  is  defined  externally,  by  way  of  a  goal  to  reach.  In  contrast  to 
Darwinian  evolution,  directed  evolution  and  genetic  algorithms  employ  goals  such  as  finding  a 
class  of  objects,  e.g.,  rules,  as  applied  to  some  task  or  problem.  Evolution  in  the  Darwinian  sense 
has  no  such  goal. 

1AA2  Algorithm 

In  applying  the  GA  to  any  proposed  activity,  one  maps  the  actors,  relations,  and  states  of  the 
domain  into  classes  or  “gene  pools.”  In  a  biochemistry  study,  actors  might  include  atoms, 
substructures  (molecule  fragments),  and  entire  molecules.  Relations  might  include  types  of  bonds 
and  various  structural  features.  Properties  such  as  hydrophobicity  may  also  be  included.  Thus, 
candidate  descriptors  covering  topological,  geometric,  electronic,  and  physicochemical  properties, 
and  mode  of  toxic  action  are  available  for  selection. 

Figure  2-9  illustrates  the  program  flow  when  running  the  GA.  Given  a  database  and  an  initial  set 
of  rules  (typically  generated  at  random),  the  system  exercises  rules  on  data,  looping  until  all  rules 
have  been  tried  on  all  data.  Following  this,  all  rules  are  evaluated  for  their  predictive  capabilities: 
selection  is  made  of  successful  candidate  rules  to  be  “parents”  in  a  breeding  exercise.  By 
applying  crossover  and  a  variety  of  point  mutations  on  individuals,  a  new  body  of  rules  is  created 
which  must  “compete”  with  the  prior  body  of  rules.  Rules  gain  or  lose  worth  based  on  prediction 
accuracy  following  exercises  with  the  data.  Rules  of  higher  worth  have  a  higher  rrobability  of 
becoming  parents  in  future  trials. 

This  process  repeats  for  many  trials.  A  learning  curve  results  from  the  accumulated  experience  of 
many  cycles  of  GA  evolution.  Directed  evolution  continually  seeks  to  improve  the  performance 
of  the  predictive  rules.  In  most  cases,  more  cycles  of  the  learning  system  results  in  better 
predictive  performance  of  the  resulting  rules. 
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Figure  2-9:  Flow  of  Activity  in  Directed  Evolution. 

Our  “Exercise  Directed  Evolution”  block  in  figure  2-9  applies  heuristic  guidance  to  the  exercise 
of  the  genetic  algorithm.  Here,  we  are  interested  in  “directing”  the  GA  toward  rapid  improvement 
in  its  discovery  of  good  predictive  rules.  A  periodic  evaluation  of  the  various  combming/mutating 
functions  of  the  GA  is  conducted  by  TSC  and  heuristic  rules  in  the  knowledge  base  direct 
changes  to  the  probability  of  occurrence  of  each  GA  function.  Heuristic  rules  offer  “suggestions"; 
the  results  from  a  given  heuristic  rule’s  firing  may  not  result  in  anticipated  improvement  of  the 
GA  performance.  The  flow  chart  of  figure  2-9  illustrates  the  cyclic  nature  of  directed  ev'-lution: 
it  is  a  search  for  improvements  in  toxicity  prediction.  By  guiding  the  evolution,  TSC  eventually 
improves  the  performance  of  the  GA  during  learning  cycles.  This  heuristic  discovery  system  is 
patterned  after  the  Eurisko  program  of  [Lenat,  1983]  and  the  Hypgene  program  of  [Karp,  1989]. 

Starting  with  a  Darwinian  Evolution  approach,  implemented  as  a  genetic  algorithm,  the  system 
follows  the  guidance  of  a  knowledge  base.  This  coupling  of  heuristic  guidance  to  a  GA  creates 
learning  curves  that  achieve  desired  results  in  a  reasonable  amount  of  time.  We  present  an 
example  of  this  coupling  later  in  the  Mutation  section. 

2.4.4J  Optimization  Strategy 

Because  GAs  are  a  family  of  iterative  search  algorithms,  and  therefore  comparable  to  both  linear 
and  non-linear  optimization  techniques,  it  is  important  to  understand  what  distinguishes  GAs 
from  conventional  systems. 

A  universal  problem  associated  with  optimization  is  that,  when  applied,  the  methods  are  typically 
over-constrained  by  the  numerous  assumptions  made  to  transform  a  dynamic  real-world  problem 
into  a  mathematical  formalization.  In  general,  optimization  techniques  have  three  difficulties:  1) 
depending  on  their  search  strategy  they  are  sensitive  to  large  or  erratic  noise  in  the  data,  2)  they 
are  hampered  by  local  performance  peaks  that  may  be  unrelated  to  the  overall  maximum,  and  3) 
their  search  strategy  uses  the  slope  of  the  function  to  select  the  next  step  in  the  search  process. 
For  the  more  complex  problems,  the  local  slope  does  not  provide  adequate  information  about  the 
location  of  the  maximum.  This  is  particularly  the  case  for  nonlinear  problems. 

These  difficulties  vary  and  depend  both  on  the  problem  and  optimization  method.  The  GA  would 
be  subject  to  these  characteristics,  but  properties  of  the  genetic  algorithm  mitigate  them.  For 
example,  local  peaks  are  escaped  by  the  mutation  operator  (discussed  below).  As  a  consequence, 
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a  GA  is  more  analogous  to  complete  enumeration  than  to  any  of  the  math  model-based  optimization 
techniques,  i.e.,  a  GA  is  largely  a  trial  and  error  process  involving  multiple  candidate  solutions 
instead  of  a  slope-guided  process  involving  a  single  candidate  solution.  But  in  contrast  to  complete 
enumeration,  the  multiple  candidates  are  only  a  small  subset  of  the  total  number  of  solutions  and 
they  are  evaluated  in  parallel  (as  a  set  of  solutions). 

A  GA  employs  and  combines  qualitative  and  quantitative  operators  encoded  as  conditions  in  the 
search  for  a  qualitative  (i.e.,  find  an  instance  of,  find  a  class  of,  or  find  all)  or  quantitative  goal 
(i.e.,  to  maximize  or  minimize  some  numeric  value).  In  comparison  with  other  conventional 
optimization  techniques,  a  GA  has  several  advantages:  1)  GAs  encode  the  parameters  which  they 
have  to  optimize  and  base  their  procedure  on  the  codes — not  on  the  parameters  themselves,  2) 
GAs  work  in  parallel  on  a  number  of  search  points  (potential  solutions)  and  not  on  a  unique 
solution,  which  means  that  the  search  method  is  not  local  in  scope  but  rather  looks  globally  at  the 
search  space,  3)  GAs  require  from  the  environment  only  an  objective  function  measuring  the 
fitness  score  of  a  candidate  solution,  and  4)  both  selection  and  recombination  steps  (discussed 
below)  are  performed  by  using  probabilistic  rules  rather  than  deterministic  ones.  [Renders  & 
Norvik,  1992). 

As  per  their  biological  origins,  GAs  imply  the  use  of  mutation  as  a  fundamental  mechanism  of 
innovative  population  variation,  but  instead  of  the  usual  genetic  material,  i.e.,  DNA  in  biology, 
problems  encoded  in  the  form  of  IF-THEN  rules  are  addressed.  In  addition  to  mutation,  GAs 
typically  rely  on  two  additional  operators  called  reproduction  and  crossover  for  population  variation. 
The  system  discussed  here  does  not  employ  reproduction,  but  limits  population  variation  to 
crossover  and  mutation.  Components  of  “if-then”  rules  (i.e.,  antecedents  and  consequents)  serve 
as  the  genome  (biological  domain),  and  the  rules  themselves  serve  as  the  phenotypes.  Thus,  a 
collection  of  rules  serve  as  a  "gene  pool”  for  crossover  and  mutation  operators.  We  discuss  those 
two  operators  next 

2.4.4.4  Crossover 

Crossover  is  regarded  in  the  literature  as  the  dominant  operator  when  compared  to  mutation. 
Using  crossover,  two  rules  are  selected  to  produce  “offspring”  by  exchanging  a  portion  of  their 
rules:  IF  (antecedent)  subjects,  objects  and  relations;  and/or  THEN  (consequent)  subjects,  objects 
and  relations  (analogous  to  gene  splicing).  The  offspring  replace  weaker  tides  in  the  population. 
Crossover  serves  two  complementary  functions.  First,  it  provides  new  points  for  further  testing 
within  the  existing  problem  “subspaces”  (represented  by  the  parent  rules).  Secondly,  it  introduces 
representanve  members  of  "subspaces”  not  already  existing  through  prior  crossover.  The  DE 
variant  of  crossover  alternates  the  type  of  selection  of  parents  between  randomly  selecting 
parents  and  on  the  basis  of  strength  (cross  those  rules  with  the  highest  predictive  accuracy). 
Given  one  pair  of  parents,  two  children  are  produced  by  the  process.  Genetic  material  that 
comprises  the  antecedents  (IF  clauses  of  the  rules)  are  spliced  and  exchanged  to  make  two  new 
children.  The  new  rules  are  placed  in  the  voting  “pool”  of  rules. 

2.4.4.5  Mutation 

Mutation  is  a  secondary  operator  in  directed  evolution,  and  is  applied  with  very  low  probability 
of  occurrence,  typically  less  than  a  few  percent  of  the  time.  Its  purpose  is  to  alter  the  encoded 
value  of  a  random  position  (point)  on  a  string.  Examples  of  point  mutation  are  insertion,  deletion, 
or  change  of  some  rule  component.  In  the  TSC  DE  system  a  selected  rule  is  copied  and  the 
mutation  operator,  selected  at  random,  is  applied  to  that  copy.  Source  of  a  "mutant”  DNA 
element  for  insertion  or  change  is  typically  a  random  member  of  the  rule  set.  In  a  “viral” 
mutation,  a  DNA  element  is  selected  at  random  from  a  source  pool  outside  the  rule  .et. 

Mutations  may  be  guided  by  individual  heuristic  IF-THEN  rules,  cr  they  may  take  the  more 
Darwinian  flavor  of  random  changes.  A  simple  example  of  a  heuristically-guided  mutation  is 
drawn  from  our  work  in  design;  a  design  rule  which  has  contributed  successfully  to  the  design  of 
a  vehicle  is  selected  for  mutation.  It  is  then  cloned  and  mutated  and  the  results  are  then  studied. 
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A  stylized  initial  rule  is: 

If  you  want  to  improve  the  performance  of  the  vehicle 

And  the  vehicle  includes  an  aerodynamic  structure  (e.g.  a  wing) 

Then  consider  increasing  span  by  10SS 

The  mutated  version  of  the  rule  is: 

If  you  want  to  improve  the  performance  of  the  vehicle 

And  the  vehicle  includes  an  aerodynamic  structure  (e.g.  a  wing) 

Then  consider  increasing  wingspan  by  20SS 

The  relatively  simple  mutation  involved  increasing  the  rale  by  which  this  rule  applies  its  “mutations” 
to  the  evolving  design.  This  mutation  was  suggested  by  a  rule  of  the  form: 

If  you  have  a  design  rule  which  is  more  then  XJS  stronger  than  other  rules 
Then  consider  cloning  the  rule 

And  mutate  the  new  clone  by  increasing/decreasing  its  rate  parameter 
And  post  a  task  to  study  the  new  rule’s  performance 

Directed  evolution  involves  a  hypercycle,  that  is,  the  design  exploration  process  is  a  cyclic 
evolution  acting  on  the  product  being  designed,  while  the  tools  of  evolution  are,  themselves, 
subject  to  the  forces  of  evolution. 

2,4.5  Genetic  Programming 

In  an  approach  to  directed  evolution  different  from  the  rule-building  discussed  above,  we  evolve 
lisp-like  programs  which  perform  some  task.  Our  programs  are  intended  to  predict  protein 
structure,  as  discussed  in  the  next  chapter.  Here,  we  discuss  this  approach  to  genetic  programming. 
Our  implementation  of  genetic  programming  is  based  on  Koza  [1992],  Genetic  programming  is 
used  to  modify  a  population  of  programs  which  perform  tests  on  the  data.  Programs  in  this  study 
are  constructed  of  boolean  expressions.  The  terminal  values  of  these  expressions  are  generated 
by  application-specific  primitive  functions. 

2.45.1  Algorithm 

Create  an  indial  (random)  population  of  programs. 

1  Execute  aid  evaluate  programs  to  determine  fitness. 

2.  Rank  order  programs  according  to  fitness. 

3.  Generate  a  new  population  of  programs  by  applying  reproduction,  crossover,  and  mutation  to  the 
best  of  the  old  prc.  *ans. 

Go  to  step  1. 

Normally  this  algorithm  chooses  the  best  program  to  appear  in  any  generation.  In  our  version  the 
result  is  the  finai  population  of  programs  with  son-zero  fitness.  These  programs  will  be  used  in 
concert  for  recall  and  prediction. 

The  process  begins  by  applying  a  population  of  randomly-generated  programs  to  the  elements  of 
a  database.  Program  results  are  placed  in  a  matrix  and  evaluated  to  obtain  a  measure  of  each 
program’s  fitness.  These  fitness  values  are  used  in  selecting  programs  to  breed  into  the  next 
generation.  This  process  is  illustrated  in  figure  2-10. 
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Figure  2-10:  Genetic  Programming. 

Programs  are  constructed  of  boolean  operators  connecting  application-specific  primitives.  Our 
implementation  constructs  programs  using  Scheme,  a  dialect  of  Lisp.  This  representation  is  easy 
to  manipulate  and  duectly  executable  in  the  Scheme  environment. 

An  evaluation  matrix  is  created  by  applying  each  program  to  each  point  in  the  training  daucFor 
each  data  point  (i),  and  for  each  program  (j),  the  program  is  executed  and  the  result  is  strata  at 
'ocation  (ij)  in  the  matrix.  Thus  each  column  of  the  matrix  contains  the  values  returned  by  a 
particular  program  for  all  the  data  points,  and  each  row  contains  the  values  returned  by  all 
programs  for  a  particular  data  point 

The  evaluation  matrix  is  used  in  evaluating  the  fitness  of  programs  using  rough  set  theory.  We 
dixuss  rough  set  technology  elsewhere  in  this  report 

2.4S.2  Survival  and  Reproduction 

Because  our  implementation  is  searching  for  a  good  population  of  programs  rather  than  a  good 
individual  program,  it  is  convenient  to  separate  the  concepts  of  survival  and  reproduction. 

2.4.S.2.1  Survival 

We  have  experimented  with  three  options  for  survival  of  programs  from  one  generation  to  the 
next. 

1 .  Retain  a  fixed  number  of  the  best  programs. 

2.  Calculate  the  minimum  SGF  (significance)  lor  all  programs  and  retain  ail  programs  with  an  SGF 
greater  than  the  minimum 

3.  Retain  M,  i  e  programs  with  SGF  >  0.  Note  that  M  is  a  minimal  set  so  rt  win  by  definition  contain  a 
diverse  population 

2.4.5JL2  Program  Reproduction 

We  have  experimented  with  two  options  for  production  of  new  programs  from  one  generation  to 
the  next 
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1 .  Replace  discarded  programs 

2.  Supplement  M  by  a  constant  number  of  programs.  This  allows  P  to  increase  or  decrease 
according  to  the  size  of  the  minimal  set  M.  Population  size  adapts  to  the  problem  and  with 
progress  toward  a  solution. 

New  programs  are  created  by  the  generic  operators,  crossover  and  mutation. 

2.453  Crossover  and  Mutation 

In  this  study  the  genetic  operators  modified  only  the  boolean  expressions.  Crossover  was 
implemented  by  selecting  one  sub-expression  from  each  parent  and  swapping  them  to  produce 
two  new  programs.  Mutation  was  implemented  by  selecting  a  sub-expression  within  a  single 
parent  and  replacing  it  with  a  randomly  generated  expression.  Random  replacement  by  the 
mutation  operator  was  the  only  method  implemented  for  modifying  a  terminal  function. 

14.53.1  Program  Crossover 

The  crossover  operator  first  selects  two  members  of  the  current  population  of  programs.  A 
crossover  point  is  selected  within  each  program  and  the  subexpressions  below  these  points  are 
swapped  to  produce  two  new  programs.  Figure  2-11  shows  two  programs  with  subexpressions 
selected. 


Figure  2-1 1 :  Subexpressions  selected  for  crossover  in  parent  programs. 


Figure  2-12:  New  programs  created  by  crossover. 

14  332  Program  MuSUion 

The  mutation  operator  first  selects  one  member  of  the  current  population  of  programs.  A  mutation 
point  is  selected  within  the  program  and  the  sub-expression  bdow  tins  point  is  replaced  with  a 
randomly  generated  expression.  Figure  2-13  shows  an  example  of  an  original  program  and  one 
possible  result  of  applying  the  mutation  operator. 
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Figure  2-13:  Original  and  new  program  created  by  mutation. 

2.5  Data  Evaluation  Tools 

2&1  Rough  Set  Evaluation  of  Genetic  Program  Fitness 

Fitness  evaluation  in  the  genetic  programming  approach  requires  an  objective  function  with 
which  to  perform  the  evaluation.  As  in  the  Darwinian  algorithm,  some  critic — typically,  die 
environment  itself  in  the  biological  domain — is  required  to  rate  die  performance  of  each  individual 
In  our  experimental  work,  we  apply  a  rough  set  evaluation  tool,  based  on  rough  classification 
described  by  Pawlak  [1984,1991]  and  Ziarko  [1989].  These  methods  allow  the  fitness  of  functions 
to  be  evaluated  in  a  context  with  other  functions.  We  expect  this  approach  to  promote  population 
diversity  by  a  natural  tendency  to  assign  lower  fitness  to  redundant  programs.  To  anticipate  the 
results,  our  trials  of  rough  set  evaluation  were  used  on  small  tasks  such  as  developing  rules 
•vhich  identify  certain  cancers  or  flowers.  The  me'hod  discussed  here  turns  out  to  be  far  too 
compute-intensive  to  be  appropriate  to  protein  studies  on  our  workstation-based  TSC.  We  do 
believe  that  these  tools,  when  ported  to  a  super-computer,  will  provide  useful  results. 

These  propeiries  are  particularly  important  for  our  applications  where  we  are  trying  to  find  a 
population  of  programs  that  perform  well  in  concert.  This  is  in  contrast  to  typical  genetic 
programming  applications  where  the  goal  is  to  find  a  singie  program  which  performs  well. 

15.1.1  Definitions 

S  » (UAV.fj  is  an  information  system  consisting  of  a  set  of  data  objects  (U),  a  set  of  attributes  (A),  a 
set  of  possible  attribute  values  (V).  ana  a  function  that  maps  data  objects  and  attributes  to 
attr&ute  values  (f). 

U  is  the  set  of  all  data  objects.  In  our  inses  application  it  is  the  set  of  an  iris  examples  in  our  database. 

A  is  the  set  of  aQ  attributes.  In  our  implementation  a  rs  the  set  of  concepts  measured  or  tested  by  the 
attribute  programs.  The  set  A  also  corresponds  to  the  set  of  ail  columns  in  the  evaluation  matrix. 

V  is  the  domain  of  attrSjute  values.  In  our  irises  application  it  is  {setosa,  vitginica,  versicolor,  true, 
false}. 

f  is  the  desertion  (unction  mapping  UxA-»V.  In  our  implementation  it  corresponds  to  the  evaluation 
matrix. 

N  is  the  set  of  predicted  attributes  or  columns  of  the  evaluation  matrix.  In  our  application  it  is  a  single 
column  containing  the  species  name  for  the  data  objects. 

P  is  the  set  ot  predictor  attributes  or  columns  ol  the  ovariation  matrix.  These  correspond  to  the  set  of 
programs  to  be  determined. 

M  is  a  minima!  subset  ot  P  which  retains  the  fun  ability  of  P  to  discern  elements  of  N\ 
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N'  is  the  set  of  elementary  sets  or  equivalence  classes  based  on  predicted  attributes.  It  corresponds 
to  the  set  o!  sets  of  rows  which  have  matching  values  In  all  the  N  columns. 

P'  is  the  set  of  elementaiy  sets  or  equivalence  classes  based  on  all  predictor  attributes.  It 
corresponds  to  the  set  of  sets  of  rows  which  have  matching  values  in  all  the  P  columns. 

M'  is  a  set  of  elementary  sets  or  equivalence  classes  based  on  a  minimal  set  of  predictor  attributes.  It 
corresponds  to  the  set  of  sets  of  rows  which  have  matching  values  in  all  the  M  columns. 

jflsKP)n',  is  the  union  of  all  elementary  sets  in  P'  which  are  subsets  of  the  ith  element  in  N'  These  are 
the  lower  approximations  of  the  sets  in  N' 

POS(P.N)  is  the  union  of  iD£KP)n’,  for  all  elements  n\  in  N\  This  is  the  subset  of  U  for  which  P  is 
sufficient  for  discerning  membership  in  the  equivalence  classes  of  N'. 

k(P.N)  a  card(POS(P,N))  /  card(U) 

This  is  the  fraction  of  the  set  U  for  which  P  is  sufficient  for  discerning  membership  in  the 
equivalence  classes  of  N\ 

SGF(P.N,pJ  >  ( k(P.N) — kfP-PjN) )  /k(P,N). 

This  is  the  significance  value;  i.e .  the  relative  change  in  k(P.N)  resulting  from  deletion  of  p  from 
P.  SGF  has  the  advantage  that  it  rewards  programs  for  their  contribution  to  recall  or  prediction  in 
the  context  of  all  other  programs 

2.5. 1.2  Evaluating  Attributes 

The  sets  and  measures  above  are  used  to  determine  the  significance  of  members  of  P  for 
classifying  members  of  U  into  the  equivalence  classes  of  N’.  One  goal  is  to  find  a  minimal  subset 
of  attributes  that  retains  the  capability  of  all  P  for  discerning  elements  of  N\  Another  goal  is  to 
assign  a  significance  value  to  each  attribute.  This  value  will  be  used  in  calculating  program 
fitness  for  the  generic  programming  process. 

The  following  method  calculates  significance  factors  for  each  program  while  determining  a 
minimal  set.  Programs  with  zero  significance  are  not  included  in  the  minimal  set,  M. 

Initialize:  Ml  »  P. 

For  each  pk  in  P: 

Calculate  SGF(Mk,N,pk) 

If  SGF(Hk,N,pk)  -  0 

then  Hk+1  ■  Mk-pk 
else  Mk+1  .  Mk 

There  may  be  several  minimal  sets.  The  order  of  selecting  pk  during  calculation  of  M  will  affect 
the  final  contents  of  M  We  have  chosen  to  select  the  pk  beginning  with  the  lowest-SGF 
members  of  the  previous  generation.  This  encourages  turnover  in  the  population  by  allowing  an 
older  program  to  be  replaced  by  an  equivalent  set  of  new  programs. 

25.2  Nearest-Neighbor  Pattern  Recognition 

The  technology  of  patient  recognition  with  the  nearest  neighbor  technique  is  primarily  applied  to 
our  protein  structure  prediction  project.  Briefly,  a  large  body  of  exemplars  is  created  based  on  a 
library  of  proteins  and  is  stored  in  computer  memory.  Each  exemplar  represents  a  window  of 
data  which  slides  along  the  amino  acid  sequence  of  any  given  protein.  The  windowed  data  is 
combined,  in  the  exemplar,  with  a  prediction  of  the  structure,  derived  directly  from  the  given 
protein.  This  collection  of  exemplars  is  then  applied  to  the  prediction  of  structure  when  presented 
with  a  window  of  data  from  the  new  protein.  The  exemplar  with  a  window  of  data  nearest  to  the 
window  of  data  from  the  new  protein  "wins"  the  right  to  offer  its  prediction.  This  prediction 
process  is  repeated  for  ail  data  windows  in  the  new  protein. 
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The  approach  used  here  is  to  collect  the  exemplars  from  a  group  of  proteins  which  are  not 
selected  from  the  design  case  library.  In  general,  we  develop  the  algorithm  by  training  on  a  large 
collection  of  proteins,  then  test  the  algorithm  on  a  different  collection  of  known  proteins.  Once 
the  algorithm  has  been  “tuned,”  it  is  next  applied  to  the  testing  of  new  proteins. 

The  TSC  code  for  structure  prediction,  which  is  applied  to  the  designed  protein,  is  based  on  an 
algorithm  documented  in  [Cost  and  Salzberg,  1993],  [Salzberg  and  Cost,  1992].  The  program 
PEBLS  (Parallel  Exemplar-based  Learning  System)  is  reported  to  achieve  accuracies  in  prediction 
of  protein  secondary  structure  as  high  as  71%.  The  algorithm  is  closely  related  to  the  memory-based 
reasoning  approach  of  [Zhang  et  al.,  1992]. 

The  PEBLS  approach  (after  [Salzberg  and  Cost,  1992])  as  implemented  in  this  work  is  as 
follows:  given  a  sequence  of  residues  from  a  fixed  length  window  from  a  protein  chain,  classify 
the  central  residue  in  the  window  as  helix,  sheet,  or  coil.  The  table  below  (after  [Salzberg  and 
Cost,  1992])  compares  the  correlation  coefficients  from  PEBLS  and  a  variety  .of  other  algorithms. 


Algorithm 

%correct 

Ca 

C{S 

Ccoll 

PEBLS 

71.0 

0.47 

0.45 

0.40 

Zhang  et  al.  1992 

86.4 

0.47 

0.387 

0.429 

Oian  &  Sejnowski  1988 

64.3 

0.41 

0.31 

0.41 

Holley  &  Karptus  1989 

63.2 

0.41 

0.32 

0.46 

The  TSC  code  applies  an  algorithm  patterned  after  PEBLS,  but  does  not  presently  include  the 
weighting  scheme  used  by  PEBLS.  Our  simplified  approach  requires  a  single  pass  through  a 
training  set  of  66  to  91  proteins  selected  for  non-intersection  of  training  proteins  with  the  protein 
set  used  for  case-based  design.  During  this  first  pass,  tables  are  constructed  that  contain  the 
distances  between  amino  acid  values.  As  explained  in  [Salzberg  and  Cost,  1992],  distance  is 
estimated  statistically;  distance  between  amino  acids  is  a  sum  over  three  classes  (helix,  sheet, 
coil).  The  sum  is  based  on  the  number  of  times  residue  1  was  classified  into  a  particular 
category,  the  total  number  of  times  that  residue  occurred,  and  the  number  of  times  residue  2  was 
classified  into  a  particular  category,  and  the  total  number  of  times  that  residue  occurred.  Values 
should  be  similar  if  they  occur  with  the  same  relative  frequency  for  all  classes. 

A  different  table  is  built  for  each  position  in  the  data  window.  Thus,  if  window  length  is 
specified  as,  say,  17,  then  17  tables  are  constructed.  Exemplars  arc  constructed  from  each 
window  presented  by  the  training  set  Each  exemplar  is  stored  in  the  form:  (( window)  pre¬ 
diction)  ).  An  example  follows:  C  C  ALA  PRO  LYS...  )  A  )  where  “A”  is  the 
prediction.  If  there  are,  say,  500  windows  of  residue  sequence  data  in  a  training  set,  there  will  be 
500  exemplars  created.  [Cost  and  Salzberg,  1993]  report  that  the  performance  of  their  algorithm 
improves  with  increasing  window  length  to  a  peak  at  a  length  of  19.  The  TSC  code  has  typically 
applied  a  window  length  of  17. 
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3. 1  Protein  Structure  Prediction 

Our  work  has  focused  on  prediction  of  protein  structures  from  a  given  amino  acid  sequence,  and 
to  design  a  protein  sequence  that  will  yield  a  given  structure.  We  use  our  directed  evolution  and 
genetic  programming  techniques,  along  with  the  nearest  neighbor  analysis.  We  then  use  case-based 
reasoning  to  design  a  protein. 

3.1.1  Methodology 

Consider  the  evolution  of  a  set  of  rules  which  are  successful  at  predicting  the  structure  of  a 
protein  when  given  certain  information  about  that  protein.  The  application  is  quite  similar  to  the 
application  of  a  genetic  algorithm  in  chemometrics  [Lucasius,  1989],  in  which  prediction  of  the 
conformational  analysis  of  DNA  molecules  was  studied.  Unlike  Lucasius  and  Kateman,  the  gene 
information  is  much  broader  than  just  a  few  parameters  such  as  the  order  of  the  nucleotides  in 
the  DNA  sequence,  bonding  distances  and  bonding  angles.  The  approach  taken  here  acknowledges 
the  need  for  many  more  parameters  as  suggested  by  Lozano-Perez  in  an  article  by  Erickson 
[1992]  as  follows:  “We’re  finding  you  need  more  like  100  data  points  to  characterize  a  molecule 
properly.”  Attributes  such  as  molecular  charge  and  hydrophobicity  add  dimensionality  that  is 
difficult  for  humans  but  simple  for  computers  to  consider. 

We  apply  two  variants  of  genetic  algorithms  in  our  DE  tools:  (1)  mutations  guided  by  random 
selection,  and  (2)  mutations  guided  by  heuristic  rules.  Both  were  illustrated  above.  We  then 
apply  DE  to  DE  itself.  This  enhancement  uses  heuristics  to  notice  current  performance  levels  of 
predicave  rules  and  alter  the  breeding  and/or  mutation  methods  to  allow  more  successful  populations 
of  rules  to  “gain  a  foothold”  and  begin  performing.  Once  performance  achieves  a  predetermined 
level,  breeding/mutation  methods  are  again  altered  by  DE  heuristics  to  either  breed  more  generalized 
or  more  specialized  rules.  In  both  cases,  we  are  building  a  set  of  rules  which  perform  prediction. 

As  noted  earlier  concerning  Darwinian  Evolution,  typically  a  GA  evolves  rules  in  a  dynamic 
(continually  changing)  environment — an  environment  Holland  describes  as  a  generator  of 
“perpetual  novelty”  or  concept  drift  The  TSC  DE  algorithm  is  also  capable  of  dealing  with  the 
concept  of  drift;  however  the  protein  structure  problem  is  a  static  environment — an  environment 
in  which  the  protein  training  sets  do  not  change  with  each  learning  cycle. 

Directed  evolution  develoos  a  population  of  rules  intended  to  predict  the  presence  of  helical 
structures  in  a  protein  when  given  the  amino  acid  sequence.  Initially,  at  system  startup,  the  DE 
randomly  produces  many  general  rules  for  predicting  a  helix,  attempting  to  fill  every  (candidate 
solution)  niche  in  the  environment.  When  the  environment  is  sufficiently  seeded,  the  DE  begins 
evaluating  those  rules.  Subsequently,  the  randomly  generated  startup  rules  are  bred  based  on  rule 
performance.  A  number  of  search  parameters  can  be  adjusted  by  the  DE  rules;  changes  to  these 
“knobs"  may  change  the  size  and  performance  of  the  gene  pool,  and  may  alter  the  probability  of 
any  given  GA  strategy. 

A  typical  protein  represented  in  a  TSC-readable  form  looks  like  the  following  frame: 
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C:  PDB3B5C 

name  CYTOCHROME. B5 

instance. of  protein 

functionality  electron. transport 

tertiary. structure  snail  .ss. rich. or. metal. rich-met¬ 
al  . rich-up. down . I igand . cages 

source  bovine. liver 

MY. DATA  (  SER  LYS  AU  VAl  IYS  TYR  TYR  THR  LEU  GLU  CLU  ILE  GLN 

LYS  HIS  ASN  ASM  SER  LYS  SER  THR  TOP  LEU  ILE  LEU  HIS  TYR  LYS  VAL  TYR  ASP  LEU 

THR  LYS  PHE  LEU  GLU  GLU  HIS  PRO  GLY  GLY  GLU  GLU  VAL  LEU  ARG  GLU  CLN  AU  GLY 

GLY  ASP  AU  THR  GLU  ASN  PHE  GLU  ASP  VAL  GLY  HIS  SER  THR  ASP  AU  ARG  GLU  LEU 

SER  LYS  THR  PHE  ILE  ILE  GLY  GLU  LEU  HIS  PRO  ASP  ASP  ARG  SER  LYS  ILE  THR  LYS 

PRO  SER  GLU  SER  ) 

HELIX.  POSITION  ((9  12)C  33  38)(44  47  )CS3  6e)C6S71)C 
81  86  )  ) 

SHEET. POSITION  (C57)C212S)(2732)(S1S«)(748B)) 
TURN. POSITION  (  C  17  21  )  C  24  28  )  (  39  42  )  C  49  52  )  ) 

3.1 2  Genetic  If-Then  Rule  Generation 

The  problem  is  to  predict  the  presence  of  helical  structures  in  a  protein  when  given  its  amino 
acid  sequence  as  illustrated.  One  of  TSC’s  methods  is  to  build  a  population  of  IF-THEN  rules 
when  given  a  “genome.”  The  genome  is  built  from  three  "chromosomes.”  the  working  components 
of  an  observer  rule.  A  pair  of  observer  rules  follows: 
c:  06S.1 

worth  680 

if. actors  (  {  ala  (  »x  )  true  )  (  glu  C  *y  )  true  )  ) 

if. relations  (  (  abuts  (  *x  *y  )  true  )  ) 

then. predict  (  (  helix  (  window  )  true  )  ) 

c:  OBS.2 

worth  560 

if. actors  C  C  leu  (  ®x  )  true  )  (  glu  (  *y  )  true  )  ) 

if. relations  C  C  abuts  (  *x  *y  )  true  )  ) 

then. predict  (  (  helix  (  window  )  true  )  ) 

The  three  chromosomes  are: 

•actors 
•  relations 
•predictions 

Actors  are  comprised  of  the  twenty  natural  amino  acids: 

Alanine  Arginine  Asparagine  Aspartate 

Cystine 

Glutamine  Glutamate  Glycine 

Histidine 

Isoleucine 

Leucine  Lysine 

Methionine 

Phenylalanine  Proline 
Serine 

Threonine  Tryptophan  Tyrosine 
Valine 

Relations  are  primarily  structural  or  spatial  in  this  example: 


ThinkAlong  Software,  Inc. 


2  Fin*I  Report 


pwj*  3-a 


abuts  A«B  (abuts  (A  B)  true) 

abuts-1  A«X«B  (cbuts-1  (A  B)  true) 

abuts-2  A«X«Y«E  (abuts-2  (A  B)  true) 


The  two  predictions,  used  as  votes  by  a  population  of  roles,  are: 

(  Helix  (  window  )  true  ) 

(  helix  (  window  )  false  ) 

An  example  sequence  for  the  protein  ACYLTRANSFERASE  (after  (Gibbs  &  Leslie,  1990])  looks  like 


the  following: 

MET  ASN  TYR 

THR 

LYS 

PHE 

ASP 

VAL 

LYS 

ASN 

TRP 

VAL 

ARG 

ARG 

GLU 

HIS 

PHE 

GLU 

PHE 

TYR 

ARG 

HIS 

ARG 

LEU 

PRO 

as 

GLY 

PHE 

SER 

LEU 

THR 

SER 

LYS 

ILE 

ASP 

ILE 

THR 

THR 

LEU 

LYS 

LYS 

SER 

LEU 

ASP  ASP 

SER 

AU 

TYR 

LYS 

PHE 

TYR 

PRO 

VAL 

NET 

ILE 

TYR 

LEU 

ILE 

AU 

GLN 

AU 

VAL 

ASN 

GLN 

PHE 

ASP 

GLU 

LEU 

ARG 

MET  ALA 

ILE 

LYS 

ASP 

ASP 

GLU 

LEU 

ILE 

VAL 

TRP 

ASP 

SER 

VAL 

ASP 

PRO 

GLN 

PHE 

THR 

VAL 

PHE 

HIS 

GLH 

GLU 

TOR 

GLU 

THR 

PHE 

SER 

AU 

LEU 

SER 

as 

PRO 

TYR 

SER 

SER 

ASP 

ILE 

ASP 

GLN 

PHE 

MET 

VAL 

ASN 

TYR 

LEU 

SER 

VAL 

NET 

GLU 

ARG 

TYR 

LYS 

SER 

ASP 

THR 

LYS 

LEU 

PHE 

PRO 

GLN 

GLY 

VAL 

THR 

PRO 

GLU 

ASH 

HIS 

LEU 

ASN 

ILE 

AU 

AU 

LEU 

FRO 

TRP 

VAL 

ASN 

PHE 

ASP 

SER 

PHE  ASN 

LEU 

ASN 

VAL 

ALA 

ASN 

PHE 

THR 

ASP 

TYR 

PHE 

AU 

PRO 

ILE 

ILE 

TOR 

MET 

ALA 

LYS 

TYR 

GLN 

GLN 

GLU 

GLY 

ASP 

ARG 

LEU 

LEU 

LEU 

PRO 

LEU 

SER 

VAL 

GLN 

VAL 

HIS 

HIS 

AU 

VAL 

as 

ASP 

GLY 

PHE 

HIS 

VAL 

ALA 

ARG 

PHE 

ILE 

ASN 

ARG 

LEU 

GLN 

GLU 

LEU 

as 

ASN 

SER 

LYS 

LEU 

LYS 

Observer  rules  are  exercised  on  segments  of  a  natural  system  as  read  from  a  data  base  called 
windows  of  sequence  data.  During  learning,  the  window  is  shifted  along  the  data.  An  example 
window  of  data,  with  a  window  sire  of  five  amino  adds,  looks  like: 

[  ala  arg  gly  ala  pro  ) 

TSC  encoders  write  a  body  of  statements  about  the  window: 

(ala  (ala.l)  true)  (arg  (arg.l)  true)  (gly  (gly.l)  true) 

(ala  (ala. 2)  true)  (pro  (pro.l)  true) 

(abuts  (ala.l  arg.l)  true)  (abuts-1  (ala.l  gly.l)  true) 

(cbuts-2  (ala.l  ala. 2)  true)  (abuts  (arg.l  gly.l)  true) 


All  rules  are  then  exercised  (allowed  to  vote)  on  this  encoded  window.  This  voting  is  repeated  as 
the  window  is  “slid”  along  the  entire  data  set.  A  reward/punishment  algorithm — pan  of  the 
directed  evolution  component  of  TSC — then  examines  the  performance  of  the  individual  rules 
which  cast  a  vote.  Following  the  “bucket  brigade"  algorithm  of  John  Holland  [Holland,  1986], 
those  rules  which  partidpate  in  the  vote,  and  which  vote  correctly,  get  a  reward  (their  worth  is 
increased).  Thinking  of  a  given  rule  and  the  source  (parents)  of  that  rule  as  a  “bloodline,” 
additional  reward  is  bestowed  upon  the  source  of  the  successful  rules. 

Once  rewards  have  been  passed  to  appropriate  roles,  a  small  decay  (reduction  of  worth)  of  all 
rules  is  computed.  This  has  the  effect  of  punishing  those  roles  which  do  not  participate  in  the 
vote,  or  which  vote  incorrectly.  Rules  whose  worth  falls  below  a  specific  value  are  eliminated. 

At  this  point  the  directed  evolution  component,  with  its  genetic  algorithm,  mutates  the  rule 
population  and  conducts  a  search  for  the  optimum  role  set.  For  example,  using  as  parents  06S.1 
and  OBS .  2  listed  above,  “sexual  recombination"  builds  a  child  that  looks  like  the  following: 


ThlnkAlcn?  Software,  Inc. 


Phis*  2  Fir-al  Report 


S*9*  3-3 


c:  OSS. 3 

ny. source  obs.l  obs.2 

ay. creator  crossover.! 

north  200 

if. actors  (  (  leu  (  *x  )  true  )  (  ala  (  *y  )  true  )  ) 

if. relations  (  (  abuts  (  *x  *y  )  true  )  ) 

then. predict  (  (  helix  (  ntndon  )  true  )  ) 

This  “child”  rule  is  added  to  the  population  of  rules  and  given  a  starting  worth  value.  Now, 
consider  the  effect  of  a  point  mutation  on  the  rule  OSS .  3  to  make  a  new  rule  OBS .  4. 
c:  OSS. 4 

ny. source  obs.3 

ny. creator  point. mutate. 2 

north  200 

if. actors  (  (  leu  (  *x  )  true  )  (  ala  (  *y  )  true  )  ) 

if. relations  (  C  abuts  (  *x  *y  )  true  )  ) 

then. predict  (  (  helix  (  nlndon  )  false  )  ) 

This  rule  is  essentially  the  same  rule  as  its  source,  OSS .  3,  except  that  it  votes  a  different  way.  If 
the  rule  is  successful,  it  will  eventually  replace  its  source  in  the  rule  population. 

To  summarize  directed  evolution,  using  a  biological  metaphor,  we  see  that: 

•  The  strongest  rotes  get  to  breed 

•  Successful  roles  get  fed  well 

•  Parents  of  successfu1  roles  get  treats 

•  AS  rules  age 

3.1 .3  Prediction  Program  Generation 

We  turn  our  discussion  from  generation  of  IF-THEN  rules  capable  of  predicting  the  conformation 
of  a  protein  from  its  amino  acid  sequence,  to  the  generation  of  lisp  programs  which  are  capable 
of  the  same  predictions.  In  this,  we  use  the  genetic  algorithm  to  evolve  programs.  As  it  turns  out, 
the  approach  we  explore  is  vastly  too  compute-intensive  for  protein  studies.  Our  preliminary 
efforts  centered,  instead,  on  the  development  of  the  approach,  which  we  discuss  now.  The 
approach  has  sufficient  merit  that  it  should  be  ported  to  a  super-computer  and  tested  on  proteins. 

3.1.4  Testing  Recall  and  Prediction 

We  have  tested  our  implementation  with  a  relatively  small  database  of  iris  flowers  (Fisher’s  iris 
data  reproduced  in  [Saliberg,  1990]).  Each  entry  includes  the  name  of  the  species  and  four 
values  for  sepal  length,  sepal  width,  petal  length,  and  petal  width. 

We  tested  for  both  recall  and  prediction.  The  data  was  partitioned  into  two  disjoint  sets  for 
training  and  prediction  testing.  Recall  was  tested  using  a  subset  of  the  training  data.  Prediction 
was  tested  using  dam  which  was  not  used  in  training.  Testing  was  accomplished  as  follows: 

1.  Each  program  in  the  minimal  set,  M,  is  applied  to  the  test  data  point 

2.  The  resulting  list  of  values  is  matched  against  the  corresponding  values  in  each  row  of  the 
evaluation  matrix.  We  use  an  analog  of  hamming  distance  to  select  a  matching  row  for  prediction. 
We  calculate  the  distance  as  the  sum  of  SGF(M,Nmi)  for  columns  m  that  do  not  match  the 
corresponding  test  value.  The  row  with  the  minimum  distance  from  the  list  of  test  values  is 
selected.  If  several  rows  have  the  same  distance  measure  then  the  first  of  these  is  arbitrarily 
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selected. 

3.  The  attribute  values  to  be  recalled  or  predicted  are  retrieved  from  the  N  columns  of  the 
selected  row.  Success  is  measured  as  the  fraction  of  the  test  data  that  is  correctly  recalled  or 
predicted. 


3.2  Protein  Design 

Protein  design  generally  refers  to  de  novo  approaches  for  new  proteins.  The  process  starts  with 
first  principles  and  attempts  to  design  model  proteins  from  scratch.  Because  the  design  process 
starts  from  scratch,  small  homology  with  native  sequences  is  expected  in  the  de  novo  approach. 
The  approach  critically  tests  the  designer’s  understanding  of  protein  structure. 

A  number  of  recent  experiments  have  explored  variants  along  the  de  novo  theme,  including 
[DeGrado,  et  aL,  1989],  [Regan  and  DeGrado,  1988],  [Wendoloski  and  Salemme,  1992],  [Fedorov, 
et  aL,  1992],  [Hecht,  et  aL,  1990]. 

For  example,  the  design  strategy  of  [Hecht,  et  aL,  1990]  is  to  use  natural  structural  motifs  to 
design  sequences  that  are  native-like  in  pattern  and  composition,  are  locally  non-repetitive,  and 
are  not  homologous  to  any  known  protein.  This  is  a  kind  of  “design  by  analogy.”  The  protein 
created  by  this  strategy  is  called  “Felix,”  a  four  helix  bundle  protein.  We  present  a  sketch  of  that 
protein  later,  together  with  a  discussion  of  an  approach  to  the  re-creation  of  the  secondary 
structure  of  Felix  using  our  TSC  system  for  case-based  design,  also  referred  to  as  design  by 
analogy.  We  now  discuss  that  effort. 

3.2.1  Case-Based  Protein  Design 

We  have  selected  a  case-based  (c.f.  [Riesbeck  and  Shank,  1989])  approach  to  protein  design. 
This  approach  suggests  the  ability  to  “leant”  directly  from  the  database  supplied  by  nature.  The 
alternative  is  de  novo  design  approaches,  which  require  a  knowledge  base  strong  in  protein 
folding  first  principles.  Coupling  of  the  case-based  approach  with  the  discovery  of  de  novo 
design  rules  is  suggested  as  an  important  extension  of  this  work.  We  apply  our  nearest  neighbor 
prediction  algorithm  to  analyze  the  designed  proteins.  Our  approach  to  design,  analysis,  and 
evaluation  of  proteins  is  diagrammed  in  figure  3-1. 
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Figure  3-1 :  General  TSC  Case-Based  Protein  Design. 


3.2.2  Approach 

In  a  big  picnrre,  our  approach  is: 

Given  a  library  of  proteins  and  a  design  declaration: 

Search  tor  ail  candidate  structures  similar  to  structures  in  the  desired  design 
Find  the  combinations  of  structures  with  a  "best  fit* 

Publish  the  results 

The  TSC  case-based  code  is  a  program  that  will  design  proteins  of  a  given  secondary  structure, 
using  a  model  from  case-based  reasoning.  An  overview  of  this  process  is  depicted  in  figure  3-2. 
By  starting  with  a  database  of  proteins  (a  case  library)  whose  structure  is  known,  the  system 
finds,  by  indexing  and  analogy,  appropriate  sequences  of  amino  acids  needed  to  produce  desired 
structures.  The  design  process  then  becomes  a  process  of  merely  “putting  together  the  pieces.” 
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|  Description  of  structure 

Figure  3-2:  Architecture  for  protein  design  and  evaluation 


Although  the  TSC  system  uses  a  moderately  extensive  database  of  amino  acids  and  their  properties, 
this  database  is  used  only  to  refine  the  design  process.  The  system  currently  uses  a  simple 
template  matching  process  in  order  to  do  its  first-pass  design,  and  then  uses  the  amino  acid 
database  to  select  options  from  that  first  design.  To  accomplish  this  task,  the  system  needs  to  be 
able  to  derive  answers  to  the  following  questions: 

1  What  staictures  have  I  seen  next  to  each  other  in  a  protein? 

2.  What  was  the  size  of  those  structures? 

3.  What  amino  acids  were  involved  in  creating  those  structures? 

4.  What  are  the  structures,  neighbors  and  sizes  involved  in  the  protein  to  be  designed? 

Furthermore,  the  system  should  be  able  to  accommodate  proteins  that  have  multiple  subunits.  A 
protein  is  said  to  have  multiple  subunits  when  it  contains  two  or  more  disconnected  sequences. 
The  system  needs  to  be  able  to  access  all  of  these  sequences,  but  also  to  know  that  they  are 
physically  disconnected.  To  accomplish  these  goals,  the  system  searches  protein  database  frames 
(database  entries),  each  of  which  holds  several  important  components.  Each  protein  frame  contains 
(at  least)  the  following  pieces  of  information:  an  ordered  list  of  the  amino  acids  that  compose  it, 
and,  for  each  possible  structure,  an  indication  of  where  (if  at  all)  the  structure  occurs  in  this 
particular  protein. 

To  more  accurately  illustrate  this,  we  present  an  example  of  a  protein  database  frame  here, 
adapted  from  the  Brookhaven  PDB.  Note  the  list  of  amino  acids,  and  the  indications  of  where 
helices,  sheets,  and  turns  are  found. 
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C:  PDB3B5C 

name  CYTOCHROME . B5 

instance. of  protein 

functionality  electron. transport 

tertiary. structure  small .ss. rich. or. metal. rich-met¬ 
al.  rich-up. down. ligand. cages 

source  bovine. liver 

MY. OAT*  c  SER  LYS  ALA  VAL  LYS  TYR  7YR  THR  LEU  GLU  GLU  ILE  GLN 

LYS  HIS  ASN  ASH  SER  LYS  SER  THR  TRP  LEU  ILE  LEU  HIS  TYR  LYS  VAL  TYR  ASP  LEU 

THR  LYS  PHE  LEU  GLU  GLU  HIS  PRO  GLY  GLY  GLU  GLU  VAL  LEU  ARG  GLU  GLN  ALA  GLY 

GLY  ASP  ALA  THR  GLU  ASN  PHE  GLU  ASP  VAL  GLY  HIS  SER  THR  ASP  ALA  ARG  GLU  LEU 

SER  LYS  THR  PHE  ILE  ILE  GLY  GLU  LEU  HIS  PRO  ASP  ASP  ARG  SER  LYS  ILE  THR  LYS 

PRO  SER  GLU  SER  ) 

HELIX. POSITION  ((9  12  )(  33  38  )(44  47  )CS3  60)C  65  71)( 

81  86  )  ) 

SHEET. POSITION  ((57)C  21  25  )(  27  32  )(  51  54)C  74  80)) 

TURN. POSITION  (  <  17  21  )  (  2*  28  )  (  39  42  )  C  49  52  )  ) 

In  addition,  a  target  protein  is  needed;  in  order  to  create  a  new  design,  TSC  needs  to  have  a 
design  specification.  The  user  creates  an  experiment  frame  which  contains  the  description  of  the 
protein  along  with  several  parameters  used  by  the  system.  The  following  is  an  example  of  an 
experiment  frame  used  in  this  design  exercise. 


C:  EXPERIMENT. 42 
INSTANCE. OF 
WORTH 
CONTEXT 
DATA. SOURCE 
DATA. FILES 
PDB1PYP  PDB1FC2 
PDB1HNE  PDB2STV 
P0B1CC5  ) 

ACTOR. SOURCE 
TEST.  ATTRIBUTES 


EXPERIMENT 

see 

PROTEIN 

PDB.OATA 

(  PDB1HDS  PDB2LTH  PDB3HHB  PDB2DHB  PDB1FDH  PCB1LDB 
PDB1TGS  PDB2CCY  PDB2CA2  PDB2CAB  PDB3SGB  PDB1SGT  PDB1PPO 
P0B1GCR  POB1MBD  PDB1M8S  PDB2CDV  PDB1CY3  PDB1LZT  PDB1FX1 


NATURAL. AMINO. ACID 

(  POLARITY  MOLECULAR. WEIGHT  SIZE.OF. SIDE. CHAIN 
SIDE. CHAIN. MUTABILITY  HYDROPATHY  ) 


TEST. WEIGHTS 

OVERLAP 

SIZE 

2. STRUCT 


(.25.10.20.10.35  ) 

1 

65 

(S  (  4  15  )  (  41  49  )  )  (  H  (  16  28  )  (  32  48  )  )  (  T  (  49  54  )) 


3.2.3  Case-Based  Design  Algorithm 

With  all  this  data  in  hand,  the  target  protein  can  be  designed  The  structure  of  the  case-based 
design  algorithm  is  as  follows; 

t .  Initialize  the  system  by  loading  the  protein  database,  and  analyzing  it  to  produce  the  information 
used  by  the  design  algorithms 

2.  Analyze  the  target  protein  to  produce  the  data  structures  necessary  for  design. 

3.  For  each  structure  in  the  target  Hr*.  do  the  following; 

a.  Search  lor  a  known  stajcture(s)  that  has  the  same  structures  both  before  and  after, 
and  differs  m  length  by  fewer  than  3  amino  acids.  H  several  can  be  found  that  are 
equally  close  in  length,  store  ail  of  them,  giving  preference  lo  longer  structures. 
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b.  If  rune  can  be  found  with  the  same  neighboring  structures,  try  to  find  a  known 
structure  that  has  similar  structures  both  before  and  after,  and  that  differs  in  length  by 
fewer  than  3  amino  acids.  Again,  keep  aii  that  are  equally  close,  with  preference  to 
ionger  ones. 

c.  If  none  can  be  found  with  similar  neighbors,  try  to  find  a  known  structure  of  closest 
length  with  any  neighbonng  structures.  Keep  all  that  are  equally  close,  with  prefer¬ 
ence  to  longer  ones 

d.  Get  the  data  (or  each  structure  from  Us  native  protein,  including  one  amino  acid  on 
either  side,  if  there  were  several  possibilities,  get  the  data  for  ai!  of  them. 

4.  At  this  point,  we  have  a  list  ot  possible  choices  (or  each  structure  in  the  protein.  For  each  possible 
neighboring  pair  of  structures,  compare  the  overlapping  amino  adds,  and  determine  their 
ditfemnce  score,  as  explained  in  section  2.1 .4.  Store  this  data. 

5.  Find  the  combination  of  possible  structures  that  I  las  the  lowest  total  deference,  compile  the  list  of 
structures  into  a  protein,  and  fill  in  tits  skrts  of  the  new  protein  frame  appropriately. 

For  the  new  protein,  it  may  be  true  that  the  system  could  not  find  an  enact  match  for  a  given 
structure,  and  resorted  to  a  structure  of  slightly  different  length.  In  the  resulting  protein  frame, 
the  system  notes  the  position  of  all  structures  as  they  are  placed  by  the  design  algorithm.  These 
positions  may  not  be  the  same  as  those  in  the  targe:  specification. 

In  placing  structures,  the  system  compares  the  similarity  of  the  ends  of  the  structure  with  the 
ends  of  its  neighbors.  For  instance,  if  the  system  is  trying  to  place  a  helix  between  two  random 
coiis,  it  will  remember  not  only  the  amino  acids  that  compose  the  helix,  but  also  the  first  amino 
acid  cut  of  the  helix  on  each  end.  Tnen,  the  lest  residue  in  the  helix  and  the  first  residue  in  the 
coil  from  tach  grabbed  helix  are  compared  with  the  last  residue  in  the  helix  and  the  first  residue 
in  the  coil  from  each  grabbed  coil.  This  comparison  is  repeated  for  every  junction  between 
structures,  and  the  overall  configuration  with  the  lowest  tool  difference  is  selected  for  the  final 
design. 

For  example,  consider  that  we  want  to  design  a  protein  containing  a  helix  that  connects  to  a  turn, 
and  we  have  hrjices  with  the  following  endings: 

...  CLY  TOP  ALA , . , .  VAL  CY5  VAL,...  AlA  LEU  VAL,...  LEU  VAl  TOR,...  ARC  TOR 
CLY 

The  last  amino  acid  in  each  group  above  is  actually  the  first  amino  acid  of  the  atm.  We  also  have 
one  turn,  with  the  following  beginning  structure:  ALA  ILE  THR...,  where  the  alanine  is  the  last 
amino  acid  of  the  helix.  The  system  will  compare  the  overlapping  region  for  each  possible 
combination  (TKP-AIA  and  ALA-ILE,  CYS-ALA  and  VAL-ILE,  LEd-ALA  and  VAL-ILE,  etc.),  and 
note  the  difference  score  for  each  choice.  It  will  then  look  at  every  possible  combination  of 
structures  to  make  up  the  entire  protein,  anc  men  choose  die  structure  that  has  the  fewest 
differences  between  neighbors.  Consider  the  following  illustration  of  this  configuration: 


<4»ltx 
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...  LEU 
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The  use  of'  only  one  extra  residue  at  each  end  of  a  structure  is  arbitrary,  aad  might  be  more 
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effective  were  there  to  be  two  or  more  compared.  An  interesting  direction  for  future  work  might 
be  'o  compare  the  effectiveness  of  using  only  one  extra  residue  with  using  more  than  one. 

To  determine  the  difference  score  of  adjacent  structures,  the  following  algorithm  is  applied; 

1 .  Determine  which  ammo  acid  sttrfcutes  (molecular  weight,  polarity.  etr..j  will  be  used  to  determine 
the  difference,  and  gather  their  possible  values.  Each  attribute  will  have  a  6a  of  discrete  possible 
values.  (The  attributes  are  supplied  by  the  user,  but  trie  algorithm  gathers  the  values. ) 

2.  Determine  the  weight  of  each  attribute  as  given  by  the  user. 

3.  For  each  amino  acid  in  the  overlapping  region,  ior  each  attribute  under  test,  calculate  the 
difference  value  as  shown  below 

4.  Sum  the  difference  for  a!i  ammo  adds  in  the  overlapping  region  to  determine  the  difference  score. 

For  example,  the  possible  values  for  molecular  weight  are:  very  light,  light,  medium,  heavy,  and 
very  heavy.  We  interpret  these  as  evenly  spaced  numerical  values  ( c.g.  1  through  5)  and  define 
the  individual  attribute  difference  as: 


Distanr 


_  jVali  -  Valj 
Max  -  Min 


We  then  define  the  amino  acid  difference  value  as: 


DlStioai  =  £  DistaiirXWeightanr 

aurihuies 


The  particular  attributes  and  weights  used  in  our  initial  evaluation  (which  can  be  seen  in  the 
description  of  the  EXPERIMENTS  experiment  shown  on  page  8)  were  chosen  “seat-of-the-pants," 
and  have  r.o  particular  theoretical  justification,  it  would  be  an  interesting  future  project  to  model 
the  effects  of  different  choices  of  these  parameters  on  the  accuracy  of  the  resulting  design. 

Consider  this  example  of  the  frame  of  a  designed  protein.  This  example  is  derived,  by  design, 
from  the  EXPERIMENT. 42  task  listed  above. 


C:  CONJI 

INSTANCE. OF 
SOURCE 

SHEET. POSITION 
HELIX. POSITION 
TURN. POSITION 
MY. DATA 


PROTEIN 
SBT. CREATE. PRO 
(  (  5  16  )  (  41  48  )  ) 

(  C  17  29  )  (  32  40  )  ) 

(  C  49  54  )  ) 

(  ILE  PRO  GLU  TYR  ARG  GLY  SER  THR  THR  GLY  THR  HIS  SER 
GLY  SER  VAL  GLY  PHE  VAL  GLY  ALA  SER  TYR  VAL  PHE  ALA  LEU  NET  ASN  ASP  PHE  LEU 

PHE  PRO  PRO  LYS  PRO  LYS  ASP  THR  LEU  LYS  ALA  ASN  VAL  PRO  PHE  VAL  ASP  TRP  ARG 

GLM  LYS  GLY  PRO  PRO  ALA  SER  PRO  LYS  ALA  ASP  ALA  PRO  ILE  ) 


The  structures  found  in  C0N_1  can  be  compared  to  the  specifications  in  EXPERIHENT.42  (the 
contents  of  2. STRUCT1  Mr>te  that  the  structures  are  not  in  exactly  the  positions  intended  by  the 
designer,  but  they  are  TSC  operates  under  the  assumption  that  random  coils  are  the  least 
important  structure-,  .  .cause  of  that,  they  can  be  shrunk  or  expanded  to  make  up  for  errors 
introduced  w  the  positioning  of  previous  structures.  This  is  why  the  second  helix  and  sheet  are  in 
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the  proper  position,  even  though  the  first  helix  and  sheet  were  not 


3.3  TE M  Control 

3.3.1  TEM  Control  Approach 

The  general  architecture  of  our  approach  is  illustrated  in  figure  3-3.  In  this  approach,  TSC,  with 
its  discovery  behaviors  and  a  new  knowledge  base  created  for  design  and  control  of  TEM 
experiments,  is  coupled  to  both  a  scope,  and  to  a  scope  simulator.  Work  to  date  has  coupled  TSC 
only  to  a  scope  simulator.  Future  work  may  complete  the  coupling  to  a  live  scope. 


3.3.2  The  TSC  Combined  Analysis/Controller 

The  generalized  approach  to  our  TEM  controller  involves  analysis  of  crystal  patterns  yielded  by 
either  a  scope  or  a  simulator.  Figure  3-4  illustrates  that  our  approach  will  combine  both  analytical 
algorithms  and  case-based  studies  of  crystals.  We  believe  thk  this  will  enable  the  TSC  system  to 
more-rapidly  and  accurately  identify  crystal  structures  by  considering  cases  fan  its  experience, 
and  relying  on  industry-standard  analytical  techniques  when  cases  fail  to  explain  a  detected 
crystal  structure. 

Figure  3-5  is  a  flowchart  of  the  analytical  approach  taken  by  the  crystal  analysis  routines  written 
for  TSC 
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Figure  3-4:  Approaches  to  studying  crystals. 
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Figure  3-5:  Flowchart  of  analysis 
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Chapter  4 


Results 


4.1  Tools 

4.1.1  Discovery  Tools 

Source  codes  for  the  nearest  neighbor  pattern  recognition  and  rough  set  evaluation  packages  are 
included  in  an  appendix  to  this  report  Source  code  for  the  genetic  rule  builder  is  also  included. 

4.1.2  Design 

Our  directed  evolution  work  with  TSC  inspired  a  design  activity  as  a  means  to  develop  and 
extend  the  capabilities  of  the  directed  evolution  approach.  It  makes  sense  that  design  be  explored 
since  all  of  materials  science  and  engineering  involves  design.  We  shall  return  to  materials 
design  in  a  discussion  of  protein  results  below;  for  now,  we  illustrate  a  different  exercise  in 
design  using  the  TSC  directed  evolution  approach.  This  project  enabled  the  development  of  the 
directed  evolution  approach  to  design. 

The  task  was  to  design  a  very  fast  sailboat  [Park,  1993],  one  with  an  unconventional  configuration 
which  would  maintain  contact  with  the  water,  and  use  a  wind  to  hit  speeds  greater  than  60  miles 
per  hour.  To  do  so,  we  applied  directed  evolution  to  the  task  of  evolving  a  design,  given  an  initial 
design,  and  wc  applied  directed  evolution  to  the  task  of  evolving  the  design  rales  themselves. 
This  was  illustrated  earlier. 

The  approach  was  to  use  the  envisionment  building  tools  to  evolve  the  design — an  environment 
of  possible  designs  grew  out  of  die  exercise.  Periodically,  a  mutation  rule  foes  and  mutates  a 
design  rale.  A  number  of  design  episodes  are  created,  some  applying  the  new  rale.  All  rales  are 
evaluated  according  to  their  contribution  to  the  design,  and  the  best  design  is  studied.  The  graph 
below  illustrates  one  of  the  designs  explored  by  TSC  The  plot  shows  the  relationship  between 
net  forward  thrust  and  boat  speed  for  a  20  mph  wind  speed.  The  concave  downward  curve 
satisfies  our  intuitions  that  the  faster  the  boat  travels,  the  less  surplus  thrust  it  will  have  to 
accelerate.  Ideally,  one  reads  the  maximum  speed  as  the  point  where  the  curve  crosses  the  x-axis. 


The  curve  offers  a  pair  of  interesting  points  worth  pondering,  two  discontinuities  are  noted.  The 
upper  left— occuring  at  low  speed— is  easily  explained  by  the  boat  lifting  a  balance  ski  out  of  the 
water  when  it  is  no  longer  needed  to  maintain,  as  sailors  would  say,  an  even  keel.  The  second 
discontinuity,  occuring  at  much  higher  speed,  is  a  bit  more  interesting.  In  fact,  we  found  this 
second  point  an  inspiration  for  discovery:  design  rales  to  keep  the  boat  in  the  water.  In  fact,  the 
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problem  of  the  boat  lifting  clear  of  the  water  remains  a  partially  unsolved  problem  at  this  writing; 
the  effective  maximum  speed  of  the  boat  is  therefore  the  speed  at  which  it  tries  to  fly.  The 
problem  suggests  a  counter-intuitive  solution:  the  heavier  the  boat,  the  fester  it  can  go. 


4.2  Applications 

4.2.1  Proteins 

4.2. 1.1  Prediction  by  GA  Rule  Building 

The  problem  is  to  predict  the  prespce  of  helical  structures  in  a  protein  when  given  its  amino 
arid  sequence.  The  DE  approach  is  to  build  a  population  of  rules  when  given  a  “genome.”  The 
genome  is  built  from  three  “chromosomes,”  the  working  components  of  an  observer  role.  A  pair 
of  observer  rales  (“englishized”  for  readability)  follow: 

RULE:  0BS.1 

IF  you  have  the  actors:  ALA  and  GLU 
AND  the  relation:  ALA  abuts  GLU 
THEN  predict  a  helix 
RULE:  OBS.2 

IF  you  have  the  actors:  LEU  and  GLU 
AND  the  relation  LEU:  abuts  GLU 
THEN  predict  a  helix 

The  three  chromosomes  are:  actors,  relations,  and  predictions.  Actors  are  comprised  of 
the  twenty  natural  amino  adds  (eg.:  Alanine  Arginine  Asparagine  etc) 

Relations  are  primarily  structural  or  spatial  in  this  example 
abuts-(LEU  is  followed  by  ALA) 

precedes.l-CLEU  Is  preceded  by  ALA  with  one  aisino  arid  between  then) 
precedes. 2-(LEU  is  preceded  by  ALA  with  two  turf  no  acids  between  then) 

The  types  of  predictions  available  are  based  on  the  structure  to  be  predicted.  Helix.  Sheet,  and 
Turn  axe  the  typical  protein  structures  to  be  predicted. 

A  TSC  experiment  begins  with  the  observer  roles  being  exercised  on  segments  of  a  protein 
database.  These  segments  arc  called  windows  of  data,  i.e.,  a  sequence  of  amino  acids.  During 
learning,  the  window  is  shifted  along  the  sequence  from  start  to  finish.  An  example  window  of 
data,  with  a  window  size  of  five  (5)  amino  arids,  looks  like: 

...  PHE  GLN  TKR  [  ala  arg  gly  ala  pro  ]  HIS  ILE  VAL... 

Shifting  of  the  window  involves  moving  the  window  to  the  right  by  one  amino  arid: 

...  PHE  GLN  THR  ALA  [  arg  gly  ala  pro  Ms  ]  ILE  VAL... 

All  rules  axe  exercised  (allowed  to  vote)  on  each  window.  Voting  is  repeated  as  the  window  is 
“slid”  along  the  entire  data  set  A  reward/punishment  algorithm  then  examines  die  performance 
of  the  individual  roles  which  cast  a  vote.  Following  the  “bucket  brigade”  algorithm  of  John 
Holland  [Holland,  1986],  those  roles  which  participate  in  the  vote,  and  which  vote  correctly,  get 
a  reward  (their  worth  is  increased). 

Once  rewards  have  been  passed  to  appropriate  rules,  a  small  decay  (reduction  of  worth)  of  all 
rules  is  computed.  This  has  the  effect  of  punishing  those  rules  which  do  not  participate  in  the 
vote,  or  vote  incorrectly.  All  rules  are  used  for  breeding  until  their  worth  falls  below  a  specified 
value,  at  which  time  they  are  eliminated  from  the  gene  pool. 

Directed  evolution  exercises  a  genetic  algorithm  on  the  role  population  to  conduct  a  search  for 
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the  optimum  rule  set.  For  example,  using  as  parents  OBS.l  and  OSS. 2  listed  above,  crossover 
builds  one  child  or  constructed  rule  that  looks  like  the  following: 

RULE:  CON. 3 

IF  you  have  the  actors  LEU  and  ALA 
AND  the  relation  LEU  abuts  ALA 
THEN  predict  a  helix 

This  “child”  rule  is  added  to  die  population  of  rules  and  given  a  starting  worth  value.  Using 
OBS.3,  point  mutation  may  build  CON. 4: 

RULE:  CON. 4 

IF  you  hove  the  actors  LEU  and  MET 
AND  the  relation  LEU  abuts  NET 
THEN  predict  a  helix 

We  fatve  developed  a  test  knowledge  base  comprised  of  70  observation  rules  (random  combinations 
of  actors  and  relations)  and  exercised  the  DE  on  this  knowledge  base.  Training  was  conducted 
with  ten  proteins  from  the  Brookhaven  Protein  Database.  Results  are  illustrated  in  figure  4-1  but 
represent  only  the  initial  performance  of  the  DE  system,  and  provide  some  early  indication  of  the 
make-up  of  rules  which  address  the  objective  of  the  project,  Le..  the  discovery  of  rules  which 
successfully  predict  helices  in  proteins  with  fair  accuracy. 
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Figure  4-1 :  System  Performance  (trial  KB)  1 1  Sep  92. 

Legend: 

error  of  omission-  X  of  helices  with  no  prediction, 

(163%-  no  errors-value  of  line  #1) 

error  of  comission-  X  of  correct  predictions, 

(160X  -  no  errors-value  of  line  *Z  ) 

Figure  4-2  shows  two  cur/es,  the  percentage  of  helices  discovered  by  die  DE  (line  #1)  and  the 
accuracy  with  which  the  rules  fired  (line  #2).  The  DE  ancmpts  to  find  most  of  the  helices  before 
it  auemps  to  improve  the  accuracy  of  the  rules.  The  graphs  also  indicates  a  few  missed  opportunities 
during  the  experiment.  For  example,  upon  finding  fifty  percent  of  the  helices,  the  DE  began 
refining  the  rules  through  “viral”  mutation.  A  tost  opportunity  was  caused  by  ending  the  experiment 
after  five  hundred  cycles;  predictive  accuracy  was  still  increasing  at  a  usefirl  rate. 

Interestingly,  the  votes  may  be  expanded  to  include  positive  and  negative  predictions,  e.g.: 
predict  helix  and  predict  no  helix.  Empirically,  because  of  the  large  number  of  “negative” 
training  examples,  false  rules  begin  to  dominate  and  performance  deteriorates,  i.e„  there  are 
typically  5  to  10  times  as  many  non-helix  as  helix  windows  in  a  protein  file.  As  a  result,  we  have 
learned  that  FALSE  predictions  are  detrimental  to  the  DE  system’s  performance,  and  thus  only 
TRUE  predictions  are  pursued.  Prediction,  in  this  research,  is  restricted  to  the  helix  structure. 

The  following  graph  illustrates  the  generally  upward  slope  of  accuracy  in  our  experimental  work 
with  the  GA  approach  to  building  protein  structure  prediction  rules.  As  work  wound  down  on 
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this  project,  the  slope  continued  to  point  upward.  Our  experience  indicates  that  this  is  a  rather 
computational-intensive  approach. 


s m-tua 


42.1.2  Prediction  by  GA  Program  Building 

Testing  of  the  genetic  programming  approach,  coupled  to  rough  set  evaluation  was  conducted 
during  this  02  exercise.  As  mentioned  before,  testing  was  limited  to  small  database  predictions. 
Our  testing  confirmed  the  abilty  of  this  approach  to  recall  stored  patterns  and  to  predict  from 
unseen  patterns.  We  achieved  100%  accuracy  on  recall  and  96%  accuracy  for  prediction.  This  is 
consistent  with  the  93%  to  100%  accuracy  reported  by  Salzberg  [1990],  Our  results  for  the  iris 
database  are  illustrated  below.  This  approach  turns  out  to  be  sufficiently  compute-intensive  that 
the  protein  application  was  not  explored  in  this  project. 

Figure  4-3  shows  the  k(P,N)  values  obtained  during  training. 


m 


Figure  4-3:  k(P.N)  during  training. 

Figure  4-4  shows  results  of  testing  for  pattern  recall  during  training.  This  data  was  obtained  by 
testing  on  a  subset  of  the  data  used  for  training. 
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Aocor*oy 


Training  C*t4 


Figure  4-5  shows  results  of  testing  for  predictive  ability  during  training.  This  data  was  obtained 
by  testing  on  a  subset  of  the  data  disjoint  from  that  used  for  training. 


Figure  4-5:  Prediction  accuracy  during  training 
Figure  4-6  shows  changes  in  the  size  of  the  minimal  set,  M,  during  training. 


ant.sn 


Figure  4-6:  Number  ot  programs  in  minimal  set  during  training 

Our  test  results  suggest  there  may  be  two  distinct  aspects  to  learning  in  this  approach.  The  first  is 
the  development  of  a  population  of  programs  sufficient  to  recall  members  of  the  training  set  Wc 
see  this  in  Figures  4-3  and  4-4  where  k(P,N)  and  the  recall  accuracy  proceed  to  the  maximum 
value  of  1.0.  At  this  point  we  might  expect  learning  to  stop.  However,  in  a  number  of  trials  we 
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found  that  prediction  accuracy  continued  to  improve,  albeit  sometimes  erratically. 

Looking  at  Figure  4-5  we  see  that  prediction  accuracy  continued  to  increase  for  some  time  after 
k(P,N)  reached  its  maximum.  We  suggest  a  explanation  may  be  found  in  the  size  of  the  minimal 
set  shown  in  Figure  4-6.  At  about  the  same  time  as  prediction  accuracy  began  its  rise  to  a  final 
maximum  of  96%  the  size  of  the  minimal  set  began  to  decrease  from  a  high  of  10  programs  to  a 
range  of  6-8  programs.  This  is  not  inconsistent  with  other  machine-learning  techniques  in  which 
smaller  representations  tend  to  have  a  greater  ability  to  generalize. 

The  biggest  drawback  to  this  approach  is  the  computational  cost  Our  implementation  performed 
well  on  the  database  of  iris  flowers,  but  the  computational  burden  was  a  significant  problem  in 
preliminary  tests  on  the  much  larger  task  of  discovering  regularities  between  the  primary  and 
secondary  structure  of  proteins. 


4.2.1.3  Prediction  by  Nearest  Neighbor  and  Protein  Design 


In  a  typical  experiment  a  protein  was  designed  by  the  TSC  case-based  design  code.  The  resulting 
design  frame  from  one  run,  known  as  C0N_3,  is: 

C:  CON.. 3 

nan*  TESTPR03 


instance. of  protein 


functionality  none 

source  design. pro 


HELIX. POSITION  (  <  10  21  )  (  51  56  )  C  59  66  )  (  71  80  )  ) 

SHEET. POSITION  (  (  27  35  }  (  41  46  )  ) 

MY. DATA  (  HIS  TRP  GLY  TYR  GLY  LYS  HIS  ASN  GLY  GLU  VAL  THR  CYS 

VAL  VAL  VAL  ASP  VAL  SER  HIS  GLU  PRO  SER  SER  LEU  ASP  CYS  SER  LEU  GLY  PHE  ASN 

VAL  GLY  ASP  SER  LEU  VAL  THR  PHE  THR  VAL  ALA  GLY  GLU  ALA  ASN  SER  CYS  VAL  GLY 

as  HIS  LEU  GLY  ASP  GLY  ASP  ASP  VAL  VAL  ALA  LYS  TYR  GLY  LEU  ASP  GLY  LEU  LYS 

PRO  LEU  ALA  GLN  SER  HIS  ALA  THR  GLY  PHE  HIS  GLY  ) 


In  early  developmental  trials  of  the  nearest  neighbor  code,  a  training  set  of  66  proteins  from  the 
Brookhaven  PDB  collection  was  selected,  which  generated  9,794  exemplars.  With  a  window 
length  of  17,  we  ran  the  nearest  neighbor  algorithm  on  the  designed  protein,  and  evaluated  the 
predictions  of  the  nearest  neighbor  as  compared  to  the  design  specification.  Consider  these 
results  (where  overall  performance  =  total  #  correct  predictions  +  total  #  predictions  available) 
from  a  trial  on  CON_3  and  compare  them  to  results  presented  this  table: 


%  alpha  correctly  predicted: . 

■  . .  |  ■"■■37 

%  beta  correctly  predicted:. 
%  coil  correctly  predicted: 

MM  67 

overall  performance: . 

50 

r 

t-t- 

0%  25%  50%  75%  100%  The  output 

of  the  nearest-neighbor  structure  prediction  algonthm  is  a  list  comprising  the  sequence,  indicating 
whether  the  amino  acid  is  part  of  a  helix  (A),  sheet  (B),  or  coil  (C).  The  trial  resulted  in  the 
following  predicted  structure: 

(CC8AABBBBCCCCCCCCBBBBB8BCCCCCCBBBBBB 

CCCAAAACCCACCAAAACCCCCCCAAAAAA) 


The  C0N_3  protein  was  intended  to  look  like  the  following: 

(CAAAAAAAAAAAACCCCCB8BBBB8BBCCCCCBBBBB 

BCCCCAAAAAACCAAAAAAAACCCCAAAA) 

The  test  was  then  repeated  with  a  training  set  of  91  proteins  (25  additional).  The  test  applied 
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15,704  exemplars.  The  test,  on  CON_3,  yielded  these  results: 


%a!pha  correctly  predicted: 
%beta  correctly  predicted: 
%coil  correctly  predicted: 
overall  performance: 


47 


(62 


;  48 

— I-I--I-I 


-M-H  | 


0%  25%  50%.  75%  100% 


This  trial  resulted  in  the  following  predicted  structure: 

(C8A8B8BBAAAA8CCCCCBBBBBBBACCCCBCCCCCC 

CCAAAACCCCCCACAABCCCCBCAAAAAA) 


A  later  trial  was  generated  to  involve  a  “redesign”  of  a  particular  de  novo  protein  documented  in 
the  literature.  The  creators  of  Felix  describe  the  protein  as  a  de  novo  antiparallel  four-helix 
bundle  designed  for  a  specific  topology  [Hecht,  et  aL,  1990].  Its  designers  intended  to  choose  an 
amino  acid  sequence  unrelated  to  any  native  sequence,  but  which  will  fold  into  a  desired  three- 
dimensional  structure. 


We  have  chosen  to  closely  follow  the  design  specifications  of  Felix,  and  apply  the  TSC  case-based 
design  code.  Note  that  our  case-based  design  does  not  duplicate  the  specific  residue  sequence  of 
Felix,  but  does  duplicate  the  secondary  structure  of  that  protein.  Figure  4-7  illustrates  the  TSC 
clone  of  Felix,  which  has  the  same  shape  as  the  original  Hecht  et  al.  sequence. 

The  case-based  design  program  created  an  amino  acid  sequence  and  named  the  protein  C0N_4T. 
Consider  the  following  design  frame  developed  to  duplicate  the  secondary  structure  of  Felix, 
with  the  application  of  analogy  rather  than  de  novo  rules: 


C:  C0N.4T 

instance. of  protein 

source  bbt. create. pro 

HELIX. POSITION  (  (  1  19  )  (  22  37  )  (  40  58  )  C  63  78  )  ) 

MY. DATA  (  PRO  ILE  LYS  TYR  LEU  GLU  PHE  ILE  SER  GLU  ALA  ILE  ILE 

HIS  VAL  LEU  HIS  SER  LYS  ASP  PHE  SER  ASP  GLY  GLU  TRP  HIS  LEU  VAL  LEU  ASN  VAL 

TRP  GLY  LYS  VAL  GLU  ASP  PHE  PRO  ILE  LYS  TYR  LEU  GLU  PHE  ILE  SER  GLU  ALA  ILE 

ILE  HIS  VAL  LEU  HIS  SER  ARG  LYS  HIS  LYS  ILE  TYR  PRO  GLY  GLN  ILE  THR  SER  ASN 

MET  PHE  CYS  ALA  GLY  TYR  LEU  GLU  ) 

The  intended  structural  configuration  is  illustrated  in  figure  4-7,  below. 
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Figure  4-7:  Felix  —following  [Hecht,  et  aJ..  1990], 

The  structure  represented  by  the  TSC-designed  sequence  was  intended  to  look  like  the  following: 
(AHuunuccAtn  Anuunnuctiunii 

AAAAAAAAAAAAACCCCAAAAAAAA) 

The  nearest  neighbor  structure  prediction  code,  trained  on  91  protects  from  the  Brookhaven  PDB 
collection  (none  of  which  are  available  to  the  design  algorithm),  predicted  the  following  structure 
from  the  sequence: 

(AtAAAAAAACCCCtCAACAAAAAAACCeCCCAAAkAA 

AAAAAAAAAAACAAACCAAAAACCC) 

The  final  results  of  the  prcdicnon  were: 


NMNMHK76 

%b«a  correctlv  oredicteo:  l 0  ! 

- r -  "  1  I  t-1  j  1 — t — 1  )-»-<  I  I'f  | 

0%  25%  5C%  75%  100% 

The  agreement  with  helix  and  turn  (coil)  design  is  interesting;  the  results  are  higher  than  published 
predictions  for  any  protein  based  strictly  on  native  sequences.  There  is  no  agreement  on  beta 
sheet  prediction  since  none  were  included  in  the  design,  and  only  a  single  instance  of  B  showed 
up  in  the  prediction. 

[Zhang  ct  al.,  1992j  report  that  the  variation  in  performance  of  a  single  algorithm  from  one  test 
set  to  another  can  be  quite  large.  A  fair  measure  of  accuracy  of  an  algorithm  is  the  average  of 
several  different  tests.  Indeed,  the  [Salzfcerg  and  Cost,  1992]  results  in  the  first  table  reflect  a 
result  averaged  over  10  tests.  The  results  reported  here  are  not  averaged  over  a  number  of  tests; 
that  remains  for  future  work. 

Our  results  illustrated  here  suggest  that  accuracy  improves  slightly  by  decreasing  the  prediction 
training  set  size.  To  examine  this  accuracy  behavior,  a  trial  was  conducted  with  the  original  66 
proteins  serving  as  the  training  set  for  the  nearest  neighbor  code.  The  designed  protein  is  C0«_4T, 
the  TSC  Felix  clone.  The  prediction  improved,  and  produced  the  following  predicted  structure: 
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(AAAAAAAAACCCCCCAAAAAAAAACCBABCCAAAAA 
A  A  A  A  (  A  A  A  A  A  A  A  A  A  B  A  C  '  A  U  A  A  C  C  C  ) 

The  prediction  made  three  errors  by  including  “B”  beta  sheets,  but  appears  to  have  improved  its 
alpha  helix  prediction.  Consider  the  results: 


%alpha  correctly  predicted:  «■— 
%beta  correctly  predicted:  1° 

T-1 

MHnH  80 

“1 - 

%coil  correctly  predicted: 

tmm  62  I 

overall  performance: 

■MunJi 

~T  t  ft-  ft 

ft 

Further 


0%  25%  50%  75%  100% 

characterization  of  this  behavior  will  require  a  population  study  of  the  selected  training  proteins. 
Preliminary  indications  are  that  certain  training  proteins  from  the  additional  set  of  25  generate 
exemplars  which  may  be  “nearer”  to  the  testing  window  than  are  exemplars  "generated  from  the 
original  66  proteins,  but  which  offer  an  improper  prediction.  Factors  involved  in  this  prediction 
performance  include  window  length,  and  training  set  homology.  These,  and  other  factors,  remain 
a  topic  of  continued  research.  An  interesting  approach  to  prediction  improvement,  as  suggested 
in  section  3.2,  is  to  enable  the  exemplar  weighting  scheme  in  the  TSC  nearest  neighbor  code. 


The  correlation  coefficients  Ca,  Cb,  and  Ccoil  reported  by  [Salzberg  and  Cost,  1992]  illustrated 
above  in  Table  1  are  computed  with  an  algorithm  due  to  [Mathews,  1975],  and  may  reflect 
slighdy  higher  values  than  those  reported  here  by  us.  The  differences,  if  any,  are  the  subject  of 
continued  study,  direct  comparison  of  results  is  problematic  since  the  [Salzberg  and  Cost,  1992} 
results  are  based  on  tests  conducted  on  proteins  from  the  Brookhaven  database,  and  our  results 
are  based  on  tests  conducted  on  proteins  designed  by  analogy  to  proteins  in  the  Brookhaven 
database.  In  addition,  PEELS  applied  a  weighting  factor  to  the  exemplars  which  was  reported  to 
improve  its  performance  significantly  over  the  unweighted  version;  our  work  has  not  yet  applied 
the  weighting  scheme.  Finally,  PEBLS  includes  a  post-processing  step  based  on  the  minimum 
sequence  length  restrictions  used  by  [Holley  and  Karplus,  1989].  This  restricts  beta  sheets  to  a 
minimum  contiguous  sequence  of  two  residues,  and  alpha  helix  no  fewer  than  four  residues.  This 
is  reported  to  improve  PEBLS  performance  [Cost  and  Salzberg.  1993]. 


4.3  TEM 

Our  results  to  date  on  the  crystallography  task  have  been  limited  by  budget  considerations  in  the 
other  tasks.  We  have  built  the  TEM  simulator  with  our  consultant  A.G.  Jackson,  and  have 
integrated  it  with  a  knowledge  base  for  TSC  which  operates  the  simulator  as  though  it  were  a 
five  TEM.  As  a  03  activity,  we  met  with  an  interested  scope  maker  and  exhibited  our  tool. 
Discussions  on  further  development  in  a  cooperative  venture  with  this  manufacturer  were  limited 
due  to  the  fact  that  they  do  not  have  a  large  enough  market  to  justify  R&D  in  this  domain.  They 
did,  however,  offer  to  us  their  entire  clientelle  by  way  of  their  research  newsletter.  Dr.  Jackson 
has  authored  a  paper  for  that  journal. 

We  now  sketch  the  demonstration,  as  it  has  been  conducted. 


4.3.1  Launching  the  TEM  Controller/Simulator 

The  following  illustrates  the  display  when  the  TEM  simulator  is  loaded. 
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4.3.3  Initialize  Simulator 

With  the  simulator  and  TSC  in  memory,  the  simulator  is  then  initialized  by  TSC.  This  inirislizaaon 
calibrates  the  simulator  to  enable  further  analysis  of  zones  and  crystal  structures.  Initialization 
involves  TSC  calculating  the  zone  on  each  of  two  different  crystal  images  generated  by  the 
simulator.  The  first  image  follows. 
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This  is  the  first  image  requested  by  TSC 


Find  Pattern  1 


Next,  the  second  image  requested  by  TSC 
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With  both  images  analyzed,  the  TSC  TEM  controller  is  now  ready  to  analyze  the  crystal  farther. 


4.3.4  User  Dialog  With  the  TEM  Simulator 

Following  initialization,  the  user  is  free  to  use  dialogs  and  the  command-line  interface  to  request 
different  images  from  the  simulator.  On  a  live  TEM,  die  same  dialogs  and  command-line  references 
would  cause  the  scope  to  produce  images.  A  typical  command-line  request  is: 
go  to  zone  [1  -4  1] 

This  request  would  cause  TSC  to  respond  with  an  image  appropriate  to  that  zone.  The  following 
images  illustrate  using  the  dialog  windows  to  explore  a  crystal 
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As  mentioned  earlier,  the  simulator  has  been  coded  and  demonstrated,  but  project  requirements 
in  other — especially  the  Protein — tasks  prevented  completion  of  the  TEM  controller  activity. 
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Chapter  5  Summary  and  Conclusions 

The  totality  of  the  work  performed  thus  far  supports  and  envisions  a  unified  tool  approach  to 
materials  science  and  engineering.  We  summarize  the  architecture  of  such  a  system  in  figure  5-1, 


To  achieve  such  an  architecture,  much  work  remains.  We  list  many  of  the  suggestions  for 
improvement  and  further  work  which  have  emerged  from  our  research.  We  suspect  that  a  tool 
very  useful  to  the  materials  fields  will  emerge  when  this  research  yields  a  successful  03  activity. 

5.1  General  Improvements 

Improvements  to  the  TSC  case-based  design  system  include: 

•  Ability  to  have  user  generated  overlap  parameters. 

As  discussed  above,  the  TSC  case-based  design  system  compares  adjacent  structures  by  selecting 
one  additional  residue  at  each  end,  and  makes  a  judgement  about  the  similarity  of  the  potential 
new  overlap  with  the  overlap  from  the  original  sequence.  This  choice  of  only  one  residue  is 
arbitrary,  and  it  may  prove  interesting  to  vary  the  amount  of  overlap  used  and  see  how  the 
accuracy  of  the  final  design  varies.  While  this  poses  a  few  minor  additional  computational 
problems  (such  as  what  to  do  when  a  chosen  structure  is  so  near  the  end  of  its  native  protein  that 
there  aren’t  enough  extra  residues  to  make  up  the  overlap),  none  of  these  computational  problems 
are  beyond  the  capability  of  TSC 
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•  Testing  of  different  parameter  values  for  amino  acid  comparison. 

The  method  used  to  make  judgements  about  the  similarity  of  neighboring  amino  acids  is  arbitrary. 
It  might  very  well  be  the  case  that  a  different  method  (a  different  set  of  attributes  and  weights,  or 
possibly  an  analysis  of  the  molecular  structure  of  the  residues)  could  produce  much  better 
results. 

•  Ability  to  generate  new  cases  through  adaptation  of  old  cases. 

At  present,  the  system  finds  all  the  potential  residue  fragments,  compares  them,  and  connects  die 
least  different  to  form  the  output  protein.  There  is  no  provision  for  modifying  the  individual 
structures  in  any  way  to  build  a  better  p.ciein.  It  may  be  useful  to  apply  a  heuristicaUy  guided 
method  for  changing  the  newly  created  protein. 

•  Validation  of  new  proteins  and  inclusion  in  case  library. 

Validating  proteins  by  X-ray  crystallography  (or  other  method)  and  placing  the  results  back  into 
the  case  libraries  would  create  an  external  feedback  loop  as  described  above.  This,  in  combination 
with  the  following,  may  allow  the  system  to  eventually  discover  new  details  of  the  first  principles 
of  protein  folding. 

•  Checking  a  new  sequence  against  library  for  unwanted  structure  matches. 

The  program  could  check  the  case  library  to  see  if  it  has  incorporated  a  sequence  of  amino  acids 
which  is  identical  to  a  known  example  of  an  unwanted  structure.  While  this  would  not  guarantee 
that  the  protein  would  form  the  desired  structure,  it  would  reduce  the  likelihood  that  the  protein 
would  fold  to  some  other  structure. 

•  Characterize  the  effects  of  larger  case-based  libraries 

The  case  libraries  used  in  the  protein  design  algorithm  are  at  this  point  a  small  subset  of  the 
Brookhaven  PDB.  It  is  anticipated  that  providing  more  cases  will  improve  the  performance, 
though  there  may  be  a  ttade-off  in  accuracy  in  comparison  to  design  time.  Further  study  will  be 
needed  to  find  the  best  mix  of  speed  and  accuracy. 

•  Long-term  improvements 

Beyond  each  of  these  immediate  improvements  to  the  TSC  system,  a  long-term  extension  is  to 
combine  case-based  design  with  aspects  of  de  novo  design.  This  would  require  that  the  system 
discover  new  protein  “first  principles”  from  both  its  design-prediction  cycles,  and  from  database 
milling. 

A  further  exercise  would  organize  the  design  and  the  training  proteins  into  a  taxonomy  as 
suggested  above.  Design  experiments  may  then  be  conducted  by  specifying  the  branch  of  the 
taxonomy  to  be  used  in  both  design  and  prediction;  the  set  selected  for  an  exercise  must  then  be 
partitioned  into  design  and  prediction  training  sets. 

Finally,  it  will  be  useful  to  construct  a  family  of  designed  proteins,  characterize  them,  and  apply 
appropriate  feedback  to  TSC  and  to  build  a  library  of  protein  designs  exhibiting  certain  (e.g. 
electro-optical)  properties.  With  this  feedback,  the  family  of  proteins  designed  may  be  classified 
as  accurate  or  inaccurate,  and  the  details  of  errors  generated  noted  in  the  design  frames.  With  this 
feedback  mechanism,  TSC  would  be  able  to  modify  its  own  memory  (akin  to  traditional  dynamic 
memory  algorithms  [Schank,  1982]). 

5.2  Prediction  System  Improvements 
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Improvements  to  the  TSC  nearest  neighbor  prediction  system  include: 

♦  Characterize  increasing  of  prediction  database 

Currently,  only  a  small  subset  of  the  Brookhaven  PDB  is  used  as  a  basis  for  the  prediction 
algorithm.  Preliminary  results  have  shown  that  increases  in  the  database  size  may  improve  the 
accuracy  of  the  nearest  neighbor  technique  in  characterizing  proteins  the  system  has  designed.  If 
some  implementational  speedups  are  introduced  to  the  algorithms,  it  may  be  practical  to  introduce 
larger  protein  databases  than  are  currently  used. 

•  Training  exemplar  weights  to  improve  the  prediction  performance. 

PEELS  [Cost  and  Salzberg,  1993]  applies  a  weighting  scheme  to  predictions  offered  by  exemplars. 
The  weights  require  a  second  training  pass  through  the  training  proteins  to  adjust  the  weights.  As 
mentioned  in  section  3.1,  [Cost  and  Salzberg,  1993]  report  the  weighting  improves  prediction 
performance  of  PEBLS. 

When  PEBLS  selects  exemplars  for  prediction,  the  distance  of  the  exemplar  window  from  the 
testing  window  is  calculated,  and  that  distance  is  multiplied  by  the  weight  value  of  the  particular 
exemplar.  Smaller  weight  values  imply  smaller  distance  values;  the  lowest  corrected  distance 
determines  the  “winning"  exemplar.  Weights  represent  a  kind  of  statistical  property  of  the 
exemplars.  Lower  weight  in  a  given  exemplar  (over  the  rest  of  the  population)  implies  that 
exemplar  is  more  reliable  at  formation  of  valuable  predictions.  The  current  implementation 
defaults  to  a  weight  equal  to  1.0  for  every  exemplar. 

The  TSC  nearest  neighbor  code  presently  allows  for  weight  training,  but  the  weighting  algorithm 
is  not  enabled  for  the  experiments  reported  here.  It  will  be  useful  and  interesting  to  enable  the 
code  and  measure  changes  to  prediction  performance. 

Overall,  it  is  fair  to  comment  that  this  02  project  evolved  over  time  to  emphasize  the  directed 
evolution  aspect  of  protein  evaluation,  largely  to  the  detriment  to  the  other  aspects  of  this 
research  originally  envisioned.  However,  the  work  continues.  We  sec  a  reasonable  03  extension 
into  the  pedagogical  applications  of  our  TEM  simulator.  We  further  sec  a  reasonable  03  extension 
of  our  nearest  neighbor,  rough  set,  and  qualitative  modeling  tools. 
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GENERALIZED  JJEAR.NEIGHB.T 

io  do.  consider  implementing  a  "tee"  of  exemplars,  perhaps  anchored  in  the  AA  database  such  that 
you  lake  each  An  in  a  .vmdow  and  walk  the  tree.  If  the  tree  no  longer  supports  your  next  AA. 
then  you  must  compare  your  window  to  all  the  REST  of  the  tree  exemplars 
This  should  cut  down  on  some  of  the  searching  needed  to  compare  exemplars  to  windows. 

dates 

05/20/93  jp2-  first  cut.  made  from  NEARNEIGHB.T  to  generalize 
away  from  proteins 

05/21/93  j?2:  added  weights,  other  minor  changes 

05/24/93  jp2:  minor  fixes  to  get «  eights  working  properly 
06,01/92  jp2:  added  correctness  totalizers 


\ - GLOBALS 

■(  *flag  “exemplar  “window  “classification  “winzad  \  These  should  not  be  global 

•tableiist  “exemplar.list  “train#  “test#  “tramlist  “tea.  list  “self.exemp 
“eta  “ccb  “clc  “cor. a  “cor.b  “cor.c  “mix  )* 

&global2ist  union  10-anchor  StglobaLltst 

\ - MISC  SUPPORT 

(  USTREPLACE 

desenpuon:  Given  a  list,  a  number,  and  a  sx.  "turn  the  list  with  that  position  element 

replaced  by  the  gi-zn  sx 
exrmple  input  (ABCDE)3X 
example  output;  (ABXDE) 
notes-  X 


c:  LIST REPLACE 
instance.of  ltsufunc 
\  my.creator  bbt 

t.take  list  number  sx 
Lgive  list 

arguments  “my.ltst  “position  *new.element 
my.vars  “pos  “new.hst  *max.pos 
algorithm  ( do 

(cond 

( ( equal?  “position  1 ) 

( bmdq  “newjist  ( cons  “newzkment  ( rest  “myjist ) ) ) ) 

(T 

( bindq  'newjist  ( concat  ( coral  ( grabiirstn  “myjist  ( subl  “position ) ) 
( list  “eewelement ) ) 

(clipjisi “myJist(addl  'position)))))) 

( return  “new.hst ) ) 


c:  SUBUNIT  OF 

sub  of  information^lci 

c:  HAS.SUB  UNITS 

insanceof  flovr.func 

my  creator  wjb 
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i.talte  non: 

i.give  list 

my.vars  "proteinJisi  *pro.name  "list  "data 

algorithm  (do 

( bindq  "protein,  list  ( get  TROTEIN  ’SUBS ) ) 

( loop.unttl  ( null?  *proteiniist  )  \for  each  putein 

( do(  bindq  "promame  ( first  "proteiniist ) ) 

( bindq  ’data  ( reverse  ( get  'pro name  MY.DATA ) ) ) 

( tfjrue  ( greater.than?  ( length  "data  )  i ) 

( bindq  ‘list  ( cons  ‘promame  'list ) ) ) 

( bindq  •protetn.list  ( rest  *proteindist ) ) ) ) 

(return ’list)) 

1= - - - - 

SET.USE 


description:  install  an  exempler  use  value  in  an  exemphr~the  third  value 

i 

c:  SET.USE 

instance*of 

ftowiunc 

my.crearor 

jp2 

ijake 

list  number 

i.gjve 

1st 

arguments 

•exemp*use 

my.vais 

•cor 

algorithm 

( lisueplace  *exemp  3  *use ) 

comment: 

algorithm 

(do 

( bindq  •exemp  ( reverse  "ettemp ) ) 

( bindq  "cor  ( first  •eaemp ) ) 

( bindq  *exemp  ( rest  ( rest  "ewnip )) ) 

( brndq  *eaemp  ( cons  "use  '•eretnp ) ) 

( bindq *eiemp( reverse ( cons  *cor  "exemp  ))) 

( return  "exemp)) 

comment; 

c:  GET.USE 
instanced 

ftow.func 

my.aeaicr 

JP2 

i.take 

list 

i.give 

number 

arguments 

•exemp 

algcmhm 

( third  "exemp ) 

comment: 

algorithm 

( second  ( reverse  *exemp ) ) 

comment; 

Ivrrur . . 

INCUSE 

descnpuon:  increment  an  esempler  correct  use  valoe  in  an  exemplar-the  third  value 

} 

c;  INCUSE 

instance.of  (low.func 
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roy.aeatcr 

iialtr. 

i.give 

arguments 

my.vars 

algorithm 


JI& 

list 

list 

'exemp 

’use 

( listreplace  'exemp  3  ( addl  ( third  'exemp ) ) ) 


comment: 

algorithm  (do 

( bindq  'exemp  ( reverse  'exemp ) ) 

(bindq *056 (second  *exemp)) 

(  bindq  *ust  ( addl  *use ) ) 

( return  ( set-ttse  ( reverse  'exemp )  'use  ))  ) 
coalmens 

SHT.C03RECT 


description:  install  an  exemplar  correct  use  value  in  an  exemplar-the  fourth  value 

] 


c:  SET.CORRECT 


instaneenf 

my.creator 

i-take 

i-give 

arguments 

algorithm 


flow-func 

jp2 

list  number 
list 

'exemp  *cor 

( listrrplace  'exemp  4  *cor  ) 


comment: 

algorithm  (do 

(  bindq  'exemp  ( reverse  'exemp ) ) 

(bindq 'exemp  ( rest *exemp ) ) 

( bindq 'exemp ( reverse  ( cons 'cor •exemp )) ) 
( return  'exemp )  1 
comment 


c:  GET.CORRECT 
instance  .of 
my  creator 

take 

t-give 

arguments 

algorithm 


Dow.func 

jp2 

list 

number 

'exemp 

(  fourth  'exemp ) 


comment 

algorithm  ( first  ( reverse  'exemp ) ) 
comment 


(==== 

INC.CORRECT 


description:  increment  an  exemplar  correct  use  value  in  an  exemplar-lhe  founh  value 

! 


c:  INCCORRECT 

tnuancc-of  Qow.ftmc 

mycra’cr  jp2 
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Ltale  list 

lgive  list 

arguments  'exemp 

my.vars  *ccr 

algorithm  ( listreplace  'exemp  4  ( addl  ( fourth  *exemp  ) ) ) 
comment: 

algorithm  (do 

( bindq 'exemp  ( reverse  *exemp ) ) 

(bindq  'cor  ( first 'exemp ) ) 

( bindq 'cor  ( addl  'cor)) 

( return  (setcotrect  ( reverse  'exemp )  'cor  >) ) 
comma* 


(===== 

GET.WEIGHT 


description:  returns  weight  from  an  exemplar 
weight  =  #uses/*correctuses 
smaller*  better 


c:  GET.WEIGHT 
instance.of 
my.creator 
l.tale 
Lgive 
arguments 
algorithm 


flow.func 

jp2 

list 

number 

•exemp 

( quouent  ( third 'exemp )( float  ( fourth 'exemp )) ) 


1  SORT.ON.WEIGHT 

description:  son  exemplar  list  on  ascendmg  weights 

) 


COMMENT: 


c:  SORT.ON.WEIGHT 

lnstancerif 

flow.func 

ray.creator 

JP2 

i.ialce 

list 

Lgive 

list 

arguments 

'list 

COMMENT: 

c:  INCACTUAL 

instanct-of 

flow.func 

my  .creatcr 

j?2 

i.tale 

sywbo! 

lgive 

argunw  ? 

vACi 

algunshm 

ccond  • 

(setq 'eta (addl  'cLa))) 
((tame? 'act’s  > 

( setq 'ctb(  addl  'ctb))) 

( i’fseos'ct.cfsddl  'etc)))) 
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c:  INCPREDICTED 


instanced 

flowJunc 

my.crcaux 

JP2 

Itahe 

symbol 

i.give 

none 

arguments 

•prtd 

\only  when  correct 

algorithm 

(cond 

( ( same?  ‘pred  ’A ) 
(setq*cora(addl  *cor.a))) 

( (  same?  ‘pred  3 ) 
(setq*cor.b(addl  ‘cor.b))) 
(T(setq*corx(addl  ‘cor.c)))) 

LIST  SUPPORT 

1  DELETEELEMENT 


) 


description:  Deletes  the  element  at  postion  ‘pos  of  list  ‘list  returning  the  resile 
numbering  begins  with  1. 


c  DE1EIEELEMENT 
instanced  fiowJucc 
my-creator  wjb 

■  ni-/-  list  number 

Lgrvc  list 

arguments  ‘list  ‘pos 

my.vars  ’temp  ‘count 
algorithm  (do 

(iMrce(  equal?  *posO)  .  . 

( rjsplay  "DELETE  ELEMENT:  Cannot  UkeOss  posrcoo  argument.  error  ) 


t 


(bindq ‘count !  ) 

(binnq ‘temp  nil) 

( loop.unnl  (  equal?  ‘count  *pos ) 

(do  (bindq ‘temp (cons (fust *U5t) ‘temp)) 

(bindq ‘list  (rest ‘list)) 

(bindq ‘count (addl  ‘count))?) 

( bindq  ‘temp  ( reverse  ‘temp ) ) 

( return  ( concat ‘temp  ( rest ‘list ))) ) 


(====== 

inserteiement 


) 


description:  Insert ‘element  at  position  *pos  of  Ust ‘list  returning  the  result 

numbering  begins  with  1. 


c  INSERTEIEMENT 
tnsanced  flow.func 
my.creator  wjb 

tiyVi-  list  number  sx 

l.givc  list 

arguments  ‘list  *pos  ‘element 

my.vars  ‘temp  ‘count 
algorithm  ( do 

(if.trx(  equal?  *posO) 
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(  display  ’INSERT-ELEMENT:  Cannot  take  0  as  position  argument'error ) 
) 

( bindq  ’count  1 ) 


(bisdq ’temp  nil) 

( loop,  until  (  equal?  ’count  ’pos ) 

(do  (bindq ’temp  (cons  (first ’list) ’temp)) 
( bindq ’list  ( rest ’list ) ) 

(bindq ’count (addl  ’count)))) 

( bindq  ’temp  ( cons  ’clement  •  temp ) ) 

( bindq  ’temp  ( reverse  ’temp ) ) 

( return  ( coocat  ’temp  ’list )) ) 


I  - . ’ ' . 

REPLACE-ELEMENT 


) 


description:  Replaces  the  element  at  position ’pos  of  list  ’list  with ’element 
returning  the  result:  numbering  begins  with  1. 


c:  REPLACE-ELEMENT 
instances!  flow.func 

myxteator  wjb 

Ltake  list  number  sx 

Lgive  list 

arguments  ’list  *pos  ’element 

my.vars  ’temp  ’count 

algorithm  (do 

( if-tme  (  equal?  *pos  0  ) 

( display  "REPLACE-ELEMENT :  Cannot  take  0  as  position  argamentCerror ) 
) 

(bindq ’count  1) 

(bindq ’temp  ml' 

( loop-onnl  ( equal?  ’count  ’pos ) 

(do  (bindq ’temp (cons (first ’list) ’temp)) 

( bindq  ’list  ( rest  ’list )  ) 

( bindq ’couni( addl  ’count)))) 

( bindq  ’temp  ( cons  ’element  ’temp ) ) 

( bindq  ’temp  ( reverse  ’temp ) ) 

( return  ( concat  ’temp  ( rest  •  list ))) ) 


-  ALGORITHM  SUPPORT 


COUNT.CLASSES 


description:  takes  the  list  of  exemplars 

typical  exemplar  looks  lie:  ((glyglu  pro  _)B  1 1) 
returns  a  list  e.g.  (2S69  1091  576  1202.38.20.42) 
which  is:  Total  (alpha  » beta  (coil  italpha  it  beta  %coU 

MUST  BE  MODIFIED  TO  INC  CLASSES  OTHER  THAN  A,  B.  and  C  AA&&&A& 

) 


c:  COUNT.CLASSES 

instance.of  flow.func 
mycreata  wjb 

i-take  list  \  exemplars 

i-give  list 

arguments  ’list 
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my.vars  *a.count'bxouni'c.cotmt  ‘length  'exemp 
algorithm  (do 
( bindq  'axount  0 ) 

( bindq  *b.coantO) 

( bindq  *c.coun:0) 

(bindq 'length  (length 'list)) 

( loop.mml  ( noli?  'list ) 

(do  (bindq 'exemp (first 'list))  \get  an  exemplar 

( if.troe  ( list?  'exemp ) 

( bindq  'exemp  ( second  'exemp ) ) )  \  get  classification 
(  cond  ( ( same?  'exemp  ’A ) 

( bindq *axoum(addl  'axount))) 

( ( same?  'exemp  'B ) 

( bindq  'b.coanl  ( addl  'bxount ) ) ) 

(T  ( bindq 'coount ( addl  'c-count ))) ) 

(bindq  ‘list  ( rest  ‘list ))) ) 

( return  ( list  'length  'amount  'beount  'c-count 
( quotient  'acount  ( float  'length ) ) 

( quotient  'bxount  ( float  'length ) ) 

( quotient  'cxount  ( float  'length ) ) ) ) ) 


(■  '  ■  -  . 

FOCUS  JN.CLASS? 

description;  Tests  to  see  if  the  AA  at  position  *pos  (focns)  in  the  protein  given  by  “ptomaine 
lies  in  class  (A,  B.  cf)  'struct.  Street  values  are  ALPHA  and  BETA. 

This  is  the  TRAINING  feedback  routine. 


c  FOCUS  JN.CLASS? 
instancejof  flow.pted 
Luke  symbol  symbol  integer 

Lgive  flag 

arguments  'struct  'pro  name  'pos 
my.vars  'truth  'position  'first 
aigorilhm  (do 
(bindq 'truth  F) 

(cond  ( ( same?  'struct 'ALPHA ) 

\typical  helix  position  slot  val:  (( 33  J8 )( 53  60)  (65  71 )( 81  86)) 

(  bindq  'position  ( value  .of  'ptomaine  ’HELIX  POSITION  ) ) ) 

(  (  same?  'struct  BETA ) 

\  typical  sheet  position  slot  val;  ( ( 5  7 )  ( 21  25 )  ( 27  32 )  (  51  54 )  ( 74  80  ) ) 

( bindq  'position  ( value  .of  'promame  ’SHEETPOSITION ) ) ) 

(T  ( bindq  “position  nil )) )  \  NOT  alpha,  NOT  beta 
(loop-until  ( or? 'truth  ( null? 'position  )  )  \  for  each  btlix/sbeet  position  pair 

(do(bindq  'fust  ( first  'position ) ; 

( bindq 'troth  ( and?  (not?  ( less-than?  *pos(  first  'fust))) 

( not?  ( greater.than? 'pos  ( second 'first )))) ) 
( bindq  *posi  non  ( rest  'posrooa ))) ) 

( return  'troth ) ) 


( .  ... 

PARTTnONJJATA 

description:  Assigns  training  and  testing  data  from  'ptoteinJist  is  glcbals  'aamJiSt  and 
'tesUist  according  to  the  values  of  'train#  and  'less*. 

Just  takes  a  list  of  proteins  and  cats  it  into  two  parts 

) 
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c:  PARTmONDATA 
instanceof  Oow.func 

my.crta'rr  wjb 

Luke  list 

igivc  none 

arguments  ‘protein.list 

my  .vats  ‘length  ‘trainxlip  'testxlip  *iemp 

algorithm  (do 

( biodq  'length  ( leng’h  ‘proteinJiK ) ) 

(if-true  ( greater  than? (plas*nairri “testa )* length) 

(display>  'Attempted  to  pttnitioa  more  data  than  exists!"  error)) 

( biodq  ‘ttaimclip  ( minus  ‘length  ‘train# ) ) 

( biodq  ‘test  dip  ( minus  ( minus  ‘length  ‘test# )  ‘train# )) 

( biodq  ‘temp  ( reverse  ‘prottinJist ) ) 

( setq  ‘nainiist  ( reverse  ( dipist  ‘temp  ‘tnimclip ) )  ) 

(bindq ‘temp  (reverse ‘temp)) 

(bindq ‘temp  (cliplist  ‘temp  ‘train# ) ) 

( bindq  "temp  ( reverse  ‘temp ) ) 

(  setq  ‘test-list  ( reverse  ( diplist  ‘temp  ‘testclip ) ) ) ) 

(=■  - - - - 

SHETDIST 

description:  Subtracts  ‘distance  from  each  of  the  entries  of  each  pair  in  •pans. 

) 

c:  SHEET JJIST 

mstance-of  Oow.func 

my.creaior  wjb 

Luke  listnumber 

Lgive  list 

arguments  ‘pairs  ‘distance 

my.vars  ‘in  ‘out ‘shifted-prs 
algorithm  (do 

(  bindq  ‘shiftedprs  ml ) 

( loop.until  (  null?  ‘pairs ) 

(do  (bindq ‘out nil) 

( bindq  ‘in  (  first  ‘pairs ) ) 

( bindq  ‘out  ( cons  ( minus  ( second  ‘in )  ‘distance )  ‘out ) ) 

( bindq ‘out  ( cons  ( mrnus  ( fust ‘in ) ‘distance  ) ‘out ) ) 

( bindq  ‘slafted-prs  ( coos  ‘out  ‘shifted-prs  )  ) 

( bindq  •pairs  ( rest ‘pairs )))  ) 

( return  (  reverse ‘shiftslprs )) ) 

(  -.a—,. . . . . 

INTERNAL-PAIRS 

description:  find  parrs  from  a  given  *pairs.list  which  are  within  c-term  Sc  ti-term 


c:  INTERNAL-PAIRS 

instanceof  Oow.fuac 
myxreatcr  wjb 

Luke  number  number  list 

i.give  list 

arguments  ‘lftend  ‘rtend  ‘pairsJist 

my.vars  ‘pair  *mtem.pairs 
algorithm  (do 


TtUnxAloaa  So£tvar«-  Inc. 


Phaw  2  Final  Raport 


paga  A-8 


( bindq  ‘imrm.pairs  nil ) 

( loop.imol  ( noil?  *pa3S.iig  ) 

(do  ( bindq  •{»»(  fast  "painim)) 

( if. true  (and?  ( not? ( lessdhan? ( 5nt *pair ) 'Iftead ) ) 

( not?  ( greater.than?  ( second  'pair )  'mend ) ) ) 
( bindq  'internqiairt  ( cons  *pair  'intenupaiis  )  ) ) 

( bindq 'paiisJist(reg'peiisJig)))) 

( return  ( reverse  'intern-paiis )) ) 


(  ,  a-.-.  ■  — 

CREATESUBUNTT 


description:  make  a  new  concept  as  a  subunitof  a  giva  concept 


} 


e-g.  hemoglobin  has  4  subunits 


c  CREATES  UEUNIT 
instance-cf  flow-func 

myxreator  wjb 

Lake  symbol  list  list  number  list 

Lgive  symbol 

arguments  *pro.name  ’helices  ‘sheets  •ptevjubsJength  'sub-data 

my.vare  *coo  *c.tam  ‘tactm  ‘subbelxes  ‘sub-sheets 
algorithm  (do 

(  bindq ‘con  (  new  jmm ) ) 

( bindq  'cserm  ( addl  ‘FrevjubsJength ) ) 

( bindq  *n.term  ( plus  *prev  sobs  length  ( length  *sub-daa ) ) ) 

( bindq  'sub-helices  ( imemaLpaiis  'fctenn  'n-lerm  ‘helices  ) ) 

( bindq  ‘sub-helices  ( shift. dist  *snbJhelices  *preu.subs-length  ) ) 

( bindq  ‘suh-sheets  ( imenaLpuirs  *csenn  •atom  'sheets  ) ) 

(bindq  'sub.sheets  ( shiltdist  ‘sub. sheets  'pttvsubs.length  )  ) 

( seLvaloe  'con  ‘SUBUNIT.OF  *pro.tucne  ) 

( sec  value  'con  "MY.DATA  '  subdata ) 

( set.  value  'con  HELIX  .POSITION  ‘sub,  helices ) 

( seLvaloe  'con  'SHEETJOSmON  ‘sub sheets ) 

( return  'con)) 


ORGANIZE  JROTEINS 


description:  examines  each  protein  io  determine 

whether  it  has  subunits. 

If  it  does,  crocepts  are  created  for  each  subunit  from 
sequence,  belix  and  sheet  data  and 
insetted  in  place  of  the  protein  name  in  the 
protein  list. 

The  protein  list  is  then  partitioned  into 
testing  and  training  data. 

NEEDS  TO  BE  REPLACED  WITH  A  GENERAL  THINGY 

) 


c:  ORGANIZE  .PROTEINS 
instanced  Qow.func 
my.aeamr  jp2 

Lake  Ust 

Lgive  list 

arguments  “protein  list 
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my.vars  ’pos  ’temp.pro.list  *pro.name ’data  *#subuntts ’helices ’sheets 

my.vars  ’ptevjubs-length  *  sub. data  ’con 

algorithm  (do(dtsplay> "  Organizing Proteins’ print) 

(display>’proteinJi$t  print) 

(bindq*posl)  \lceeps  track  of  position  of  current  protein 

{ bindq  ’temp.pro.Ust  *ptoiein.list ) 

( loop.unnl  (null?  ’tetnp.pro  list )  \  for  each  protein 

(do  {  bindq  •promame  ( first ’temp.proiist)) 

( bindq  ’data  ( reverse  ( get  ’pro.name  'MYDATA ) ) ) 

(  bindq  *#subunits(  length  ’data)) 

\this  assumes  subunits  ait  in  nested  lists  from  'mydaia 
( if.true  ( greater.than?  *#subunits  1 )  \  if  more  than  one  subunit 
(do  ( bindq *ptotein.list ( delete.eletnent *proteitUtst *pos ) ) 

(  bindq  ’helices  ( va!ue.of  *promame  "HELIX POSITION ) ) 

( bindq  ’sheets  ( value.of  ’pro.name  'SHEET JOSITION ) ) 

( bindq  *prevsubsJength  0  ) 

( loop-uniil  ( null?  ’data )  \  for  each  subunit  of  protein  data 

(do  ( bindq  *sub.data(  first  ’data  ))\get  new  subunit 
(  bindq  ’con  ( create.su burnt  ’promanse  ’helices  ’sheets 

•prtvsubs.  length  ’sub.data ) ) 

\  is  this  smart  enough  not  to  count  same  protein  again? 

( bindq  *protein.Ust  ( insertekraent  ’protein  list  *pos  ’coo ) ) 

( bindq  *prevsubsJength  ( plus  ( length  ’subdata )  ’prev  jubsJength ) ) 
( bindq  ’data  ( rest  ’data ))) ) 

(  bindq  *pos  ( addl  *pos ))) ) 

( bindq  *pos  ( addl  *pos ) ) 

(  bindq  ’temppro.Ust  ( test  *  temp.proiist )  )  )  ) 

( return ’proteiniist ) ) 


GET-PROTEINS 

description:  Gets  the  Ust  of  proteins, 

RANDOMIZES  them  and  examines  each  to  determine 
whether  it  has  subunits. 

If  it  does,  concepts  are  created  for  each  subunit  from 
sequence,  helix  and  sheet  data  and 
inserted  in  place  of  the  protein  name  in  the 
proteiniist 

The  protein  Ust  is  then  partitioned  into 
testing  2nd  training  fofa 

NEEDS  TO  BE  REPLACED  WTTH  A  GENERAL  THINGY 

) 

c:  GETPROTEINS 

instanceof  fiow.func 
ray.crcator  jp2 

italce  none 

i.give  Ust 

my.vars  ’ptoteinJtst 
algorithm  (do 

( bindq  •proteinjist  ( get 'PROTEIN 'SUBS ) ) 

( bindq  •protenJist  ( random  izeiist  ’ptoteiniist ) ) 

( bindq  'protein. Us  ( organizeproteins  ’proteinJis  ) ) 
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(  partition.daia  'proteinJist ) 
( retain  'train.list ) ) 


rnitAA^Lcrrs 


descnp00n:  'mtlFEMIJRE.COUNT,  ALPHA.COUNT.  BETA.COUNT  &  COE..COUNT 

of  each  AAw(00...0)  where  the  length  of  this  list  is  given  by'wuUength. 


:  tNIT.AA-SLOTS 

instanceof  flow.func 

my.creatcr  wjb 

i.ake  number 

i.givc  none 

arguments  *win.length 

my.vars  *  count  *  initial  *  AAiist  •  AA 

algonthm  (do 
(bindq 'count  1 ) 

(  bindq  ^initial  nil ) 

( bindq  'AAiiSt  (  get  ■NATURALAMINO.ACro  'SUBS  )  ) 
(loop. until  (grcaier.than? 'count 'win-length) 

(  do(  bindq 'initial  ( cons  0 'initial ) ) 

( bindq  'count  ( addl  'count ) ) )  ) 

( loop.unttl  ( null?  'AAiist ) 

(  do  (  bindq  'AA  (  first '  AAiist ) ) 

( seLvalue  'AA  TEATURECOUNT  'initial ) 
(  set  value  'AA  'A1PHA.COUNT  'initial ) 

( set  value  'AA  •BETA.COUNT  'initial ) 

( setvalue  '  AA  'COtL.COUNT  'initial ) 

(  bindq 'AAJist  (  rest 'AA-liSt ))))  ) 


COUNT.AA'S 

'“SSKSKSU.BU™"!  COILCOUNT 
slots  of  each  AAto(00...0) 

where  the  length  of  this  list  is  given  by  'winiength. 

Then  for  each  AA  in  each  of  the  proteins  used  for  training 
examine  die  window  determined  by  this  AA  and  'wimad: 

1.  Increment  the  ith  entry  of  the  FEATURE.COUNT  slot  of  die  AA  in  window 

2^If  the  window  is  centered  on  an  AA  which  is  part  of  a  particular  class, 
increment  the  ith  entry  of  the  corresponding  slot  of  the  AA  in  window  position 
i  and  set  the  classification  flag.  (A,  B.C  for  proteins) 

3  Create  an  exemplar  from  the  window,  the  current  value  of  classification  and  the 
exemplar  weight  ( initialUy  set  to  1 )  and  place  u  on  'exemplar disc 
e.g.:  ( window  classification  #use  #correct ) 

((glyglupro...)  B  1  1 ) 


c:  COUNT.AA'S 

instance  x>.f  flow.func 

my.creainr  jp2  wjb 

[lake  number 

i.give  none 

arguments  *wuuad 

my.vars  'protemiis!  'proname  'pro  array 


•proJength  'windengih  'focus  *i 
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algorithm  ( do  (dtsplay>  "Counting  AAs”  print) 

(  setq  “exemplar  Jist  ml ) 

(cond  (‘mix 

( bindq  “protein  Jist  ( getproieins ) ) ) 

(T  ( do  (binij“protein.list(  gel  PROTEIN  BUBS)) 

(  bindq  “proteinJist  ( organize-proteins  “proteinJist ) ) 
( setq  “trainJist  “proteinJist ) ) ) ) 

( biodq  “winJength  ( addl  ( times  “win-rad  2 ) ) ) 

{ imt-AAjlots  “winJength ) 

( loop.uodl  (null?  “proteinjist ) 

( do  (  biodq  “presume  ( fust  “protein  Jist ) ) 

(  disptay>  ”“pro.name:"  print )  ( display  “pro.name  pnm ) 

( bindq  “pro  away  ( atate.proteinjrniy  “praname ) ) 

( bindq  “proJength  ( array  1@  “proarray  0 ) ) 


(  bindq  *  focus  1 )  „ 

(loop.until  ( greater.than?  *  focus  •proJength  )  \  For  each  focus  in  protein  ... 

(do  (setq 'exemplar  ml) 

(  seiq  •  window  nil ) 

(bindq  *i  1 ) 

(setq *028 F)  ,  .  , 

( loop.until  (  greater.than?  “i  “winJength )  \ For  each  clement  of  window 
(do  (bindq “dtsp(subl  (minus *i  “wimrad))) 

(bindq  “pos  (plus  “focus  “dup ) ) 

( if xrue  ( and?  (greater.than?  “posQ) 

( not?  ( greamr.than?  “pos  “proJength  )) ) 

(do  ( bindq *AA( array  1@  “proarray “pos ) ) 

(if .true  (and?  (equal?“il)  .  .  .. 

(and?  (greater.than?  “focus  “wnuad) 

1 1x<  Ih,n?  •fneus  (  addl  ( minus  “DtoJength  “wimad  )  )  )  )  ) 


(setq’flagT)) 

(if.tnie  “flag 

(setq  “window  ( cons  “AA  “window ) ) ) 

(  bindq  “featureJist  ( value.of  “AA  'FEATORE.COUNT ) ) 
(if.tnie  (null? “featureJist) 

(do(display>  “i  print) 

( display  *aa  print) 

(display “featureJist  print) 


(bindq “temp ( addl  ( nth “featureJist  ( subl  *i)))) 

\(  display  “temp  pnm) 

(bindq  “feaiurt.list  ( repiace-demem  “feamreJist  “i  “temp ) ) 
( seuvalue  •  AA  7EATURE.COUNT  “feaaneJist ) 


\(disp!ay>  ‘  CounbAA  4"  print ) 

(cond  ( ( focus in.class?  'ALPHA “pro.name “focus ) 

( do  ( bindq  “list  ( value-of  *  AA  'ALPHA.COUNT ) ) 
(  bindq  “temp  (addl  (nth  “list  (sobl  “i)))) 

( bindq  “list  ( rtplice. element  “list  “t  “temp  )  ) 

1  vi  value  “AA  ’ALPHA.COUNT  “list ) 


( setq  “classification  ’A))) 

( (  focusJmclass?  BETA  “promame  “focus ) 

( do  ( bindq  “list  ( value  .of  •  AA  BETA-COUNT ) ) 

( bindq  *lemp( addl  (nth  “list  (subl  “i)))) 

( bindq  “list  ( repUcc-element  “list  “i  “temp ) ) 

(  set. value  “AA  BETA.COUNT  “list ) 

( setq  “classification  B ) ) ) 

(T  (do (bindq “list (valuexjf “AACOILCOUNT)) 
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(  bindq  ‘temp  ( addl  ( nth  “list  ( subl  “i ) ) ) ) 

( bindq  ‘list  ( replace-elemenr  ‘list  “i  *temp ) ) 

( set-value  “AA  'COIL.COUNT  “list ) 

( setq  “classification  'C ) ) ) ) 

)) 

(bindq  *i(addl  “i)) 

)) 

(it  true  “flag 

(do  ( setq “exemplar ( cons  1  “exemplar)) 

( setq  “exemplar  ( cons  1  “exemplar ) ) 

( setq  “exemplar  ( cons  “classification  “exemplar )  )\(  classification  #nse  Hcorrtct ) 
( setq  “window  ( reverse  “window ) ) 

( setq  “exemplar  ( cons  “window  “exemplar ) ) 

\(  window  classification  #use  #correct ) 

( setq  “exemplar Jist  ( cons  “exemplar  “exemplarJist ) ) ) ) 

( bindq  “focus  (addl  “focus ) ) 

)) 

( bindq  “proteinJist  ( rest  “proteiniist ) ) ) ) 

\couid  son  exemplar  list  on  weight  ... 

( setq  “exemplarJist  ( reverse  “exemplarJist ) ) 

( display*  *••»•••••••  AA  slots  set  ••**••“••“*  print ) 

( dispiay>  ■••••••  Exemplars  created  •••••*••  prat ) 

( disolay>  ( first  “exemplarJist )  print ) 

( display*  "FIRST  EXEMPLARS  CREATED;"  log  ) 

( display*  ( first  “exemplar.list )  log ) 

) 

(“=  . . 

MAKE.TABLE 

description-  Given  a  list  of  rauos  build  the  table  (  2D  array )  of  the  SW-VDM  values  of  this  list 

) 

c:  MAKE.TABLE 

imtance.cf  flow.func 

my.creator  wjb 

ualte  list  list 

tgive  symbol 

argumems  “ratio.hst  “labeljist 

ray.vars  “dim  “table  “i  “j  “trioJ  “tno.j  “ratioJ  “ratio.j  “delta  “temp  “next 

algorithm  (do 

( bindq  “dim  ( addl  ( length  “rano.list ) ) ) 

( bindq  “table  ( create mray2  “dim  “dim ) ) 

( bindcj  “j  1 ) 

( bindq  “temp  “label  Jist ) 

( loop.unul  ( equal?  “j  “dim )  \  fill  row  0  with  labels  (AA's) 

( do  ( bindq  “next  ( fust  “temp ) ) 

( array2!  “table  0  “j  “next ) 

( bindq  “temp  ( rest  “temp ) ) 

(bindq  “j  (addl  “j)))) 

(bindq  *i  1 ) 

( bindq  “temp  “labeljist ) 

( loop.until  ( equal?  *i  “dim  )\fiU  column  0  with  labels  (AA’s) 

(do  ( bindq  “next  ( first  “temp ) ) 

( anay2!  “table  “i  0  “next ) 
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( bindq  'temp  (  rest  'temp  ) ) 

( bmdq  '1  ( addl  '1 ) ) ) ) 

( bindq  *1 1 ) 

( loop.unnl  ( equal?  'i  'dim ) 

(do  (bindq *j  1 ) 

( loop.unul  ( equal?  *j  'dim ) 

( do  ( bindq  'tno.i  ( nih  'rauolist  ( subl  *1 ) ) ) 

( bindq  *tno.j  (  nth  'raticdist  ( subl  *j ) ) ) 

(bindq  'delta  0) 

( loop.until  ( null?  *trio.i ) 

( do  ( bindq  *  ratio. i  ( first  'trio.i ) ) 

( bindq  'ratio.j  ( first  'trio  j ) ) 

( bindq  'delta  ( plus  ( abs  ( minus  'ratio  j  'ratio.j ) )  'delta ) ) 
( bindq  'trio.i  ( rest  *tno.i ) ) 

( bindq  'trio.j  ( rest  'trio.j ) ) ) ) 

( airay2!  'table  *i  *j  'delta ) 

(bindq 'j  (addl 'j)))) 

( bindq  'i  ( addl  'i ) ) ) ) 

( return  'table ) ) 

(  —  — . -  - . . . - 

CREATE.TABLES 

description:  For  each  window  position  i  construct  a  list  of  ratio  snos. 

The  mss  are  obtained  by  dividing  the  :th  element  of  each 
of  the  structure  count  slots  by  the  ith  element  of 
the  FE  ATURE.COUNT  slot  for  each  of  the  20  AAs. 

These  lists  are  then  passed  to  MAKE.TABLE  and  the  resulting 
handels  are  returned  in  a  list 


c:  CREATE.TABLES 

instance.of  fiow.func 

my.creaior  wjb 

i.take  number 

i  give  list 

arguments  'wui.rad 

ray.vars  'AAJist  'feature  'wmJength  'temp  'ratioltst  'AA 
my.  vats  *num  'denom  'ratio  'tno  'tables 

algoruhm  (do 

( display>  'Creating  tables print " 

( bindq  'AAJist  (  get  'NATIIRALAMINOA.CID  ’SUBS  )  ) 

(bindq 'tables  ml) 

( bindq  'feaure  1 ) 

( bindq  'wmJength  ( addl  ( limes  'win rad  2 ) ) ) 

( loop.unul  ( greater.than?  'feature  'win.length )  \for  each  feature  in  a  window 

( do  ( bindq  'temp  '  AA-list ) 

( bindq  'ratioJist  nil ) 

( loop.umil  ( null?  'temp ) 

\  for  each  amino  arie  &&&&generalite 

(do  ( bmdq 'AA(  fust 'temp)) 

( bindq  *tno  nil) 

( bindq  'denom  ( nth  ( valueof  'AA  FEATURE.COUNT  )  ( subl  'feature ) ) ) 
\(  displays  "'denom ■"  debug )  ( display  'denom  debug ) 

( bindq  'num  ( nth  ( value-of ' AA  ALPHA.COUNT )  ( subl  'fratire ) ) ) 

\  ( displays  "'numl:'  debug )  ( display  'num  debug ) 
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( cond  ( ( equal?  ‘denom  0 ) 

(bindq ‘rctioO}) 

(T  (bindq ‘ratio (quotient *nura ( float ‘denom))))) 

( bindq ‘tno(  cats ‘ratio  *tno)) 

( bindq  *num  ( nth  ( value.of  *AA  BETA.COUNT )  ( sub!  ‘feature ) )  ) 
\(  display*  "‘nuni?:'  debug  )  ( display  *mira  debug ) 

( cond  ( ( equal?  ‘denom  0 ) 

( bindq  ‘ratio  0 ) ) 

(T  (bindq ‘ratio  ( quotient ‘nttm  (float ‘denom))))) 

( bindq  *tno  ( cons  ‘ratio  ‘trio  ) ) 

( bindq  *num  ( nth  ( value.of  ‘  AA  COIL.COUNT )  ( subl  ‘feature ) ) ) 
\  ( display*  '‘num3:*  debug )  ( display  *num  debug ) 

(cond  (( equal? ‘denom 0 ) 

(bindq ‘ratioO)) 

(T  (bindq  ‘ratio  ( quotient  *nam  ( float ‘denom )))) ) 

( bindq  ‘trio  ( cons  ‘ratio  ‘trio ) ) 

\  ( display*  "‘trio:’  debug )  (  display  *tno  debug ) 


( bindq  ‘ranoJist  ( cons  ‘trio  ‘ratio .list ) ) 

(  bindq  ‘temp  ( rest  ‘temp ) ) ) ) 

( bindq  *ratioJist  ( reverse  ‘ratio  Jist ) ) 
l  bindq  ‘tables  ( cons  ( make.table  “ratio  Jist  *  AA  Jist )  ‘tables ) ) 
(bindq  ‘feature  (addl  ‘feature)))) 

( display*  Table  handles:'  print )  { display  ( reverse  ‘tables )  print ) 

( return  (  reverse  ‘tables )) ) 


(===== 

GET.WINDOW 


description: 


1 


Returns  the  ‘d3ta  window  centered  on  ‘focus  with  radius  ‘wirnrad.  An  error 
is  reported  if  the  requested  window  extends  outside  the  dare. 


GET.WINDOW 
lnsutKx.of  ilowiiE-c 

ray.cresuor  wjb 

i.take  number  list  number 

l.give.  list 

arguments  ‘wiajad  ‘data  ‘focus 

rav.vars  ‘length  ‘temp 
algorithm  (do 

( bindq  ‘length  ( length  ‘dsia ) ) 

( if.trae  ( or?  ( not?  { greater.than?  ‘focus  ‘winuad )  > 

( grea  w.tl<an?  ‘focus  (minus  ‘leugin  ‘wimrad))) 

( di^'eay  "GET.WINDOYi;  Requested  window  outside  data!"  error ) 

i 


(bindq  ‘temp ( clipjist  ‘data ( subl  ( minus  ‘focus  ‘wuuad ) ) ) ) 

( bindq  ‘temp  ( reverse  ‘temp ) ) 

( bindq  ‘temp  ( clipjist  ‘temp  ( minus  ‘length  ( plus  ‘focus  ‘winuad ) ) ) ) 
( return  ( rever.e  ‘temp ) ) ) 


DELTAAA 

description:  Returns  the  entry  at  row  *AAi  and  column  *AAt  of  table  ‘k. 
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entry  is  Distance  between  *AAi  and  *AAj 


DELTA  .AA 
instance.of 

ilow.func 

my.creator 

wjb 

i.take 

symbol  symbol  number 

i.give 

number 

arguments 

•AAi  ’AAj  ’k 

my.vcrs 

•AAJiSl’i’j 

( bindq  ’AAJist  ( get  'NATUKALAMINO.ACID  '5XJBS  ) ) 
\  POSITION  letunts  0  if  X  not  found  in  list  - 
(bindq *i  ( position ’AAi’AAJisO) 

( bindq  *j  (position ’AAj ’AAJtst)) 

( return  (array2@(  nth ’iable.list(subl  ’k))’i*j))) 


DELTA.WIHDOW 

description:  Returns  the  distance  between  thepvii  windows  using 
Salaberg's  method  with  r  -  1  and  weight  «=  1. 

When  i*l,  yields  'teanhattan'  distsrce 
When  r*2,  yields  ■euclidian"  distance 
Satzberg  uses  2  typically,  but  1  on  protein  problem 
Ripiile  along  window  of  given  length,  comparing  '(Cannes':  sum  up  sunilanocs.  then 
multiply  by  weights 


c:  DELTA.  WINDOW 

instance-of  flow.fanc 

my.creator  wjb 

i.take  list  list 

l.givc  number 

arguments  *wmdl  *exemp 

my.vars  *k ’sum ’AAi ’AAj ’delta ’wind2 


algorithm  ( do 
( bindq  *k  1 ) 

(bindq ’sum  0) 

(  bindq ’wmd2(  firs: ’eaemp))  '  get  window  from  exemplar 
( loop.unal  ( null?  ’windl ) 

(do  (bmdq’AAi(  first ’windl)) 

( bindq  •  AAj  ( fust ’wina2)) 

(  bindq ’della  (  delta.  AA’ AAi’ AAj ’k))  V. distance  between  2  feature  values 
\if  r=2,  you  would  square  ’delta  here  *”””* 

( bindq  ’sum  ( plus  *  sum  ’delta ) ) 

( bindq ’k( add!  ’k)) 

( bindq  ’wind2  ( rest  ’wind? ) ) 

( bindq ’windl  ( rest ’windl ))) ) 

\  now  that  you  have  sum,  multiply  by  Wr.  times  Wy  —  the  weights  *” 


\  aammeWy  is  always  1.0  (for  now  )”*”** 

\  (displays' '  SUM* '  print )( display  ’tom  prim) 

\  (  dispiay  ’  WT*  ’  pnnt )  ( display  ( get-wetght  “esemp )  print ) 

( bindq  ’sum  ( times  ’sum  ( get. weight  ’exetnp ) ) ) 


\  ( display  ’sum  pnnt ) 

( return ’sum ) ) 
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GHTPREDimON 

description:  Locates  the  exemplar  closest  to  window  and  returns  u 

‘delta.best  is  the  distance  metric  cf  the  uosest  find  -  smaller=better 
exemplars  ait  stored  as  a  triple: 

e.g.:  ( window  classifies!®  #use  ^correct ) 

((glyglupro_.)B  1 1 )  - Asheiix, B=sheet, C=eoil 


c:  GETPREDICTION 


instance.of 

fiow.func 

myxreatotr 

JP-wjb 

Lake 

'istlist 

igivc 

list 

arguments 

“window  ‘exemplars 

my.v2n 

•delta-best  ‘exemp  ‘delta 

algonthm 

(do 

( bindq  *delta.best  1000 ) 

(loop.Bnul  ( null? ‘exemplars )  \  For  each  exemplar 
( do  ( bindq  ‘exenp  ( fust  ‘exemplars ) ) 

(  bindq  ‘delta  ( delta.window  ‘window  ‘exemp  )  ) 

\  delta  is  adjusted  for  exemplar  weight,  if  use.weighis=T 
\NOTEt  if  ‘delta  =  0,  exit  loop  -  you're  done 
(cond  (( equal? ‘delta 0) 

(do  (bindq ‘best ‘exemp) 

(  birdq  *delta.best  ‘delta  ) 

(  bindq ‘exemplars  nil ))  ) 

Mill  loop 

( ( less.than?  ‘data  ‘delta-best ) 

\waJdi  for  best 

(do  (bindq ‘best ‘exemp) 

(bindq ‘delta-best ‘delm))) 

(Tnil)) 

( bindq  ‘exemplats  ( rest  ‘exemplars ) ) ) ) 

\(display>"  ‘BEST:'  print)  ( display ‘best  print) 

\  <  display  ‘delta-best  print ) 
t  return  *best  )) 

RECAU_P«£DICnON 

description:  Locates  the  exemplar  closest  to  window  and  retans  it 

•dcita-bcst  ts  the  distance  metric  of  the  closest  find  -  smaller=better 
exemplan  are  stored  as  a  triple: 

e.gs  ( window  classification  (fuse  ifetxrect ) 

((giyg!up<o_.)B  1 1 )  - A»helix. B=sheet, C=coii 


) 


c:  RECALLPREDICTION 
instance.of  flow.fitnc 

my.crtaior  jp2 

i.>ake  list  ltst 

i.give  list 

arguments  ‘window  ‘exemplars 
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my  .van  •  JeltaSest  ‘exemp  ‘deia  ’best 
algoridira  (do  (X'A •self.exairp ml ) 

( bmdq  *de!ta.best  1000 ) 

( loop-umil  ( ..ull?  ‘exemplars )  \Foreich  exemplar 
( do  (  bindq  ‘exemp  ( fust  ‘exemplars ) ) 

( bindq ‘delta  ( delta. window ‘window  ‘exemp  )) 

\  delta  is  adjusted  for  exemplar  weight,  if  u$e.wcigfcts=T 
(if.true  ( equal?  ‘de^sO) 

( setq  ‘seXexemp  ‘exemp  ) ) 

(if.true  ( not? ( equal? *dela0)) 

( do  ( if.true  ( less.than?  ‘delta  ‘dtltibcst  > 

(do 


( bindq  *delta.best ‘delta))))) 

(bindq ‘exemplars  (resx’exempta)))) 

\(display> '  ‘BEST:’ print)  (display "best  print )( display »delta.best print) 
(return  ‘best  )) 

(  . - - 

COLLECT JtESULTS 

description:  Gm.  1  lists  of  predicted  and  actual  classes  . 

determine  the  percent  accuracy  of  alpha,  beta,  cod  and  overall  prediction 
return  them  in  a  list  e.g.  ( .52 .17 .56  .50 ) 

1 


\  remember  "self 
\sSJp  "self 
\  watch  for  next  best 
(  bindq  *best ’exemp) 


c:  COLLECT J1ESULTS 
instancemf  Dow.i'unc 

my.creaor  wjb 

i.tatc  list  list 

i.givc  list 

arguments  ‘actual ‘predicted 

my.vars  ‘firstacnial  ‘fustprtdicted  *%conecLA  •66coaectB  ^cotrectC  "correct 
my.vars  ‘total-A  ‘correctA  ‘totaLB  ‘conectB  ‘touLC  ‘correct-C 
algorithm  (do 

( bindq  “rotal-A.  0 )  ( bmdq  ‘cotrecLA  0 ) 

(bindq ‘totaJSO)  (bmdq ‘corrects  0) 

( bindq  ‘total-C  0 )  ( bindq  ‘correctC  0  ) 

( loopuintil (null?  ‘actual ) 

(do  (bindq  ‘fustactua!  ( fust ‘acoiai ) ) 

( bindq  ‘fustpredicted  ( first  ‘predicted ) ) 

(cond  ((saint?‘fustactnal'A) 

(do  ( bindq  ‘total.A(addI ‘tocaLA)) 

(iLtrae  ( same?  ‘fusLactuai  ‘firstpredicted ) 

( bindq  ‘correctA ( addl  ‘correct A )))) ) 

( f  same?  ‘fusLactuai  3 ) 

(do  (bindq ‘totals  (addl ‘totalS)) 

(if .true  (same? ‘fusLactuai ‘fasLptedtcted ) 

( bindq  ‘corrects  ( addl  ‘corrects ) ) ) ) ) 

( (  same?  ‘fust  actual  'C ) 

(  do  ( bindq  *total.C(  addl ‘total.C)) 

( tf.tnie  ( same?  ‘fttstjcttial  ‘fustpredicted) 

(bindq ‘correctC (atHl ‘correctC))))) 

( T  ( display  TEST.CLASSES:  Bogus  ‘actual  list!"  error ) ) 
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) 

(bindq ‘actual  (rest ‘actual)) 

( bindq  “predicted  (  rest  'predicted ))) ) 

(cond  ((greater.than?  *totaJ_A.  0) 

( bindq  *%correcLA(quotiem‘coatctA(  float ‘totaLA)))) 
(T(bindq*9&conectAO)))  '.avoid  dividing  by  0 

(cond  ( ( grearer.thnn? *totalB  0) 

( bindq  “BcotrectB  ( quotient  ‘correctB  ( float ‘totaLB ) ) ) ) 

(T  ( bindq  *%correctB  0)))  \  avoid  dividing  by  0 

( cond  ( ( greater.than?  ‘totaLC  0 ) 

(  bindq  *  %cotrea.C  ( quotient  *cctrea.C  ( float  *  totaLC ))) ) 

(T  (buidq  ‘ScorraxC  0 ) ) )  \avoid  dividing  by  0 

( bindq  ‘%ccrrect(  quotient  ( plus  ( plus  ’cotrtcLA  •comctB )  *correet.C  ) 

( float  ( plus  ( plus  ‘totaLA  *totaLB )  *total.C ) ) ) ) 
(retum(list  •%corect-A,9(coiiecLB,%coiiecLC*9eCorrect))) 

ct  COLLECT JFINAL  RESULTS 


insunccxif 

flowiunc 

ray.creaior 

ipi 

Lake 

none 

Lgivc 

lisa 

my.vars 

•%a**b‘?fcc 

al^oriihra 

(do 

(  cccd  { ( greater.than?  ‘ctA  0 ) 

(  bindq  *%A  ( quotient  ‘cor.A  ( float  *clA  ))) ) 
(T(btndq*%A0)))  \  avoid  dividing  by  0 

<  cond  ( ( fsater.than'  *clB  0 ) 

( bindq  *SB  ( quotient  *cor.B  (  flo2t  *ctB ) ) ) ) 
(T(bindq‘%B0>))  \  avoid  dividing  ty  0 

(cond  ((gn»rer.!han?  ‘ctC  0) 

(  bindq  *%C  (  qaoiieat  *cor.C  (  float  *ctC  >))  ) 

(T  (bindq ‘SCO)))  \avoid  dividing  by  0 

(  renin  ( 1st  a  *%b  *%c ) ) ) 


CHHCK.CLASSES 

description:  check  to  set  witch  class  is  involved  -  return  fee  class 
MUST  BE  GENES  AUZED  TO  OTHER  CLASSES 


c:  CHSCK.CLASSES 

insssnceof  Oow.furc 

ray-creotcr  jpj 

iaakt  symbol  integer 

i.give  symbol 

arguments  ‘pro  ’  fucus 

my.vats  ‘tslt 

algorithm  ( do  (  cond  ( ( focusunciass?  'ALPHA  ‘pro ‘focus) 

\  generalize  &&A& 

(b'ntiq’fslt  'A)) 

( ( focttsjnxJajs?  BETA  'fro  ‘focus ) 

\  generalize  ki&i 

(bindq  Vsit  "B )) 

(T  (bi*!q‘rsU  'C))) 

(  return  *islt)) 


l 
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TEST.CLASSES 


description:  Uses  tables,  exemplars,  and  tea  proteins  to  cctutruci  lists  of  actual  «nd 
predicted  classes  which  ire  used  so  genesis  test  results. 

Used  for  testing  the  mining  set  to  adjust  weights,  and  for  tunning  test  set 
MUST  BE  GENERALIZED  TO  OTHER  CLASSES 

) 

c:  TEST.CLASSES 

usance  .of  flowiunc 

my  cream  jp2  wjb 

make  list  list  list 

t.give  list 

arguments  ‘tables  ’exemplars  ’proteins 

my.vars  ’actual  ’predicted  ’pro  ’pro.data  ’projength  ’focus  ’wrndow  ’results  ’exerr.p 
algorithm  ( do 

(  display?  Testing  Results  ..’print ) 

( loop.until  ( null?  ’proteins ) 

\  for  each  test  protein 

(do  ( setq ’ctaO)(  setq ’cub  0)(  setq ’chcO) 

( setq  ’com  0 )  ( setq  ’cor.b  0 )  ( setq  ’core  0  > 

(  bindq  ’actual  nil )  \clear  the  lists 
( bindq  ’predicted  nil ) 

( bindq  ’pro  ( first  ’proteins ) ) 

( display?  ’Next  Protein:’  pnnt )  ( display  ’pro  print ) 

( display?  "Next  Protein:’  log )  ( display  ’pro  log ) 

(  bindq  ’pro data  (  valuedf  ’pro  MY.DATA  )  ) 

( bindq  ’prolengtb  ( length  ’pro.data ) ) 

( display?  ’’pro length:  ’  print )  ( display  ’proJength  print ) 

(  display  ’  Window  Radius:  *  pnnt )  ( display  ’witrad  print) 

( bindq  ’focus  ( add)  ’wituad)) 

(loop,  until  ( greater  than?  ’focus  ( minus  'pralengiii  ’winxad ) ) 

\  for  e*£  focus  along  sequence 

( do  { bindq  ’actual  ( cons  ( check. classes  ’pro  ’focus )  •actual ) ) 

( display?  ’’focus:’  faint )  ( display  ’focus  pnnt ) 

\ ’cental  grows  as  this  loop  runs 

\(  display?  ’’actual:’  print )  ( display  'actual  print ) 

( bind',  •window  <  get.window  ’wituad  ’ftodaa  ’focus ) ) 

\(  display?  ■•window:’  prim )  ( display  ’window  print) 

( bindq  ’exemp  ( getp>  edition  •  window  ’exemplars ) ) 

( bindq  *pre<hced  ( cans  ( second  ’oxemp )  ’predicted ) ) 

SNOW:  update  gkjbals  to  keep  track  of  bow  we  arc  doing 
( incxctual  ( Tint  ’actual ) ) 

(ifsrue  •  same?  ( fust  “predicted )( first  ’actual ) ) 

( ineptetiicted  ( first  ’pttdscted ) ) ) 

V’prtdited  grows  as  this  loop  runs 

sf  display?  '’predicted:’  print  >  ( display  ’predicted  print ) 

( iiaroef equal?  (roou  ’focus  10)0) 

(do  ( bindq  ’results  ( coUectfiaaLresults ) ) 

( display?  ’  Intermediate  Result*’  print )( display  *?esuhj  print ) 

( display?  ’  Intermediate  Results:’  log )  ( display?  ’results  log ) ) ) 

(bindq  *focus(  add  1  ’focus)))) 

(display?  ’predicted:’ pNiXthspUy  ’pro  print) 

(  dsplay?  ( reverse  •predicted )  pent ) 

( display?  '  PROTEIN  LOOKS  USE  THE  FOLLOWING:’  bg ) 

( display?  *pro  log) 

(display?’  ACTUALi’log) 

{  display?  ( reverse  ’actual )  log  ) 
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(displays-  PREDICTED:- log) 

( displays  ( reverse  ‘predicted )  log  ) 

( bindq  ‘results  ( coUectfmalresults ) ) 

( displays "  Final  Results"  print )  ( display  ‘insults  prim ) 

( displays '  Final  Results:"  log )( displays  ‘results  log  ) 

( bindq  ‘results  ( collectresults  ( revesse  ‘actual )( reverse  ‘predicted )) ) 
( displays  "  List  Comparison  Results-*  log ) 

<  displays  ‘results  log ) 

( bindq  "proteins  ( rest  ‘proteins )))  ) 

( return  'resuits ) ) 

Mcdlectresulis  ( reverse  ‘actual )  ( reverse  ‘predated ) ) ) 

{ 

train.whghts 


1 


description:  Used  for  testing  die  training  set  to  adjust  weights. 

MUST  BE  GENERALIZED  TO  OTHER  CLASSES 


c:  TRAIN.WHGHTS 

instance-of  flow.fimc 
my.cre3tor  jp2 

i.take  list  list  list 

ogive  list 

arguments  ‘tables  ‘exemplars  ‘proteins 

sty.vais  ‘actual  ‘predicted  ‘pro  ‘prodaia  ‘pralength  ‘focus  ‘window  ‘exemp ‘exensps 
algorithm  (do 

i  displays  Training  Weights print ) 

( bindj  'excmps  nil )  \  holds  growing  list  of  exemplars 

(bindq ‘actual  ml) 

(bindq ‘predicted  nil) 

( iooD.tuuil  ( null?  ‘proteins ) 

'.for  each  protein 

(do  (bindq ‘pro (fust ‘proteins)) 

( displays  -‘pro:"  print )  ( display  ‘pro  print ) 

( bindq  ‘jrodata  ( vsduexsf  ‘pro  MY.DA7A ) ) 

( bindq  *proJength  ( length  ‘pro data ) ) 

( displays  "‘proaength: "  print )  ( display  •proJenglh  pnnt ) 

'  display  *  Window  Radias: "  print )  ( display  •  wiuad  print ) 

(bindq ‘focus (add!  ‘winsad)) 

(loop.until  (greaterdan?‘focus(imnus*pr&!eagdi*wimred)) 

\  for  each  focut  along  sequence 

(do  (bindq ‘actual  ( chcetxlasses ‘pro « focus ) ) 

( bindq  ‘window  ( gei  window  ‘smyad  ‘ptti.daii  ‘focus ) ) 

( bindq  ‘exemp  ( recall  .prediction  ‘window  ‘exemplars ) ) 

\get  nearest  neighbor 

( bindq  *pedicted  ( second  'exemp ) ) 

(ittrue  ( tame?  ‘actual  ‘predicted ) 

( bindq  ‘exemp  ( incoocect  ‘exeotp )) ) 

( bindq  ‘exemp  ( inc-ass  'rxerep ) ) 

•>t  displays  TRAINED "  prim )  ( disniiy  ‘exemp  prim ) 

( bindq  ‘exeinps  ( coos  ‘exemp  ‘extssjB ) ) 

(bindq 'focus (addl  ‘focus)))) 

{bindq  *ptoteri5(  rest ‘preterits)))) 

( return  ‘exetnps ) ) 


Y 


THE  ALGORITHM 
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c:  RUN.NEAREST.NEIGHBOR 
instanceof  fiow.func 


my  creator  jp2  wjb 

i.tate  number  number  number  number  symbol 

i.givC  list 

arguments  ’wiruadius  ’trtm  ’test  ’max-winjad  *wt 

my.vars  "results  ’output  ’temp 
algorithm  (do 

(  display?  "Is  this  repeated?'  pnnt ) 

(display?  "Window  Radius "  log )  ( displays  ’winxadius  log ) 

(setq ’seed  2.0)  \notused&&&& 

( setq  ’wiruad  ’wmradius ) 

(setq ’train# ’train) 

(setq ’test# ’test) 

(  setq  *<xa  0 )( setq ’ctb  0 )( setq  *ct.c  0 ) 

(setq  ’coraO)  ( setq  *cor.b  0 )  ( setq  ’cor.c  0) 

( looaucti!  (  greater.tltan?  ’wiarad  ’max-winjad ) 

(do  ( cocnua's ’win jad )  \ Sets globals ’tesUist ’resUist and ’exemplarJist 

(displays  ■•oainJisf  log )( displays  ’ttamJist  log ) 

( displays  "#oaining  proteins:  ’  log )  ( display  ( length  ’aainJist )  log  ) 

(displays  "use  weights:-  log )( displays  ’wt  log ) 

(  displays  "#£xemplais  ="  log )  ( display  ( length  ’exemplar Jist )  log  ) 

( bindq  ’output  ( counLClasses  ’exemplarist ) ) 

(displays"  Exemplars: Tot #alpha #beta #coil %alpha %beta fccoil:" log ) 

(  displays  'output  log  ) 

(  displays  ‘•tesulisT  log )  (  displays  ’testis:  log ) 

\  first,  we  budd  the  feature  value  difference  tables 

( setq  ’tabSeist  ( cteate-tables  ’win  xad ) ) 

\  TRAIN 

\  now.  we  should  test  them  on  the  training  set  and  set  weights 

(if true  (same?  *wtT) 

( setq  ’exemplar  Jist 

( train.weights  ’tableJist  ’exemplar Jot  ’trainJist ) ) ) 

( displays  -#Exemplais  after  weight  naming  »"  log )  ( display  ( length  ’exemplar  Jist )  log  ) 
\  now,  we  lest  them  on  a  test  set 

( bindq  ’results  ( tesurlasses  ’lableJist  ’exemplarJist  ’tesUist ) ) 

\(disptey>",*>alphapredictedbynearesineighboi:"prini)(displ2y(fim*results)ptta) 


\(  displays 
\(<tsplays 
\(  displays 
\(  displays 
\(dispfays 
\(dtsp&y> 
\(duplays 


-ftalpha  predicted  by  neatest  neighbor"  log )( display  ( fust  ’results)  log) 

"%be»  predicted  by  nearest  neighbor,  "pnnt)  (display  (second 'results)  print) 

■  Itbeia  predicted  by  nearest  neighbor  ’  log )  ( display  ( second  'results )  log  ) 
-%coii  predicted  by  nearest  neighbor  'print  )(<tisplay(  third  ’results)  print) 
'Scoil  predicted  by  nearest  neighbor"  log )( display  ( third  *resu!ts)  log) 
"Coverall  predicted  by  nearea neighbor  "print; (display  (fourth ’results) prim) 
•%  overall  predicted  by  nearest  neighbor "  log )  ( display  ( fourth  ’results )  log ) 


( setq  ’wmrad  ( plus  ’wmsaJ  2 ) )  \uc  window  radius  by  2 
(setq  ’seed  2.0)  \  net  used  &S&& 
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\  ( displays  "FINAL  RESULTS: "  print )( display  ( collect-EnaLresults )  pnnt ) 
\<  displays  "Fatal  Results:"  log )  ( displays  ( coUecLfinaLresults )  log ) 

( displays  "ALL  DONE... '  print ) ) 
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( 

ROUGH.SET.SAX 

description:  basic  Rough  Sets  modified  by  Scoa  King 
based  on  RX-4.TST.T 

need  to  do:  rank  ordenng  of  aanbutes 

generalize  to  multiple  slot  data  moving 
trim  universe  length  to  shortest  data  list  length 

to  improve:  X 

PROBLEMS:  X 

notes:  x 

CHANGES: 

2A)9/93  first  cut  NewCentury  12point,  TabStops=4,  LineWrap=100. 

2/10.93  jp2:  ninning  on  car  data;  general  francos  for  Browning  study;  first  Browning  run  done 

2/1 1/93  jp2:  adding  rankordexing  cols 

) 


ALMOST.EQUAL? 

description: 

checks  to  see  if  twonumbas  are  within  some  distance  of  each  other 

example  input: 

2J  2.6 .15 

example  output: 
notes: 

T 

ALMOST.EQUAL? 

sub.of 

predicate 

i.take 

number  number  number 

i.give 

flag 

arguments 

’numl  ’num2  ’error 

algorithm 

( between?  *num2  ( minus  ’nural  ’error )  ( plus  ’numl  ’error ) ) 

CALC.BETA 

deKripnon: 

check  degree  of  intersect  of  A  in  B 

example  input: 

(134)(13467) 

example  output 

06 

notes: 

X 

e:  CALC. BETA 

insiance.of  flowered 

i.take  list  usi 

i.give  nuisber 

arguments  *setl  *sei2 

my.vars  ’count  ’length 

algorithm  ( do  <  bindq ’length  ( length ’set2 )  ) 

(bindq  ’count  0) 

( loop,  until  ( null?  ’sell )  'foe  every  sell  member 
( do  ( if  true  ( inemter?  ( first  ’sell )  *set2  ) 

( bodq  ’count  ( phisl  ’count ) ) ) 
( bindq  ’set!  ( rest ’sell ) ) ) ) 
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( reuim  ( quotient  ‘count  ‘length  )  )  ) 


(  SORT.COL5  QN.WORTH 

description:  sort  to  deeendmg  worth-  "bubble  sort' 

example  input:  list  of  concepts 

example  output:  sorted  bst  of  concepts 
notes:  made  from  SORT.ON.WORTH 


c:  GET.COL  WORTH 
instance  of  flow.func 
i.talce  number  symbol 


i.jive 

arguments 

algorithm 


number 
*col  ‘array 

( value.of  ( array2@  ‘array  0  ‘col )  "WORTH ) 


SORT.COLS.ON.WORTH  ' seems  to  work 

instance.of  flow.func 

i.take  list  symbol 

i  give  fist 

arguments  "  list  ‘array 

my.vars  ‘templist  ‘moved 

algonthm  ( 

do  ( bindq  ‘lemplisi  nil ) 

(bindq  ‘moved  T) 

(loop until  (not? ‘moved) 

( do  (  bindq  ‘moved  F  ) 

(loop.until  (null? ‘list)  ,  ,  .  .  . 

(do  \ ( displays  ‘list  debug )( displays  •##*  debug )  ( display  ‘templist debug ) 
(cond  ((equal? (length ‘list)  1 ) 

(do  ( bindq  *templist(  cons  (first ‘list) ‘templist)) 

(bindq ‘list  (rest ‘list)))) 

((greater.than? 

( get.col.wonh  ( second  ‘list )  ‘array ) 

( get-colworth  ( first  ‘fist )  ‘array ) ) 

(do  ( bindq ‘tempb$l(  cons  (second ‘list) ‘templist)) 
(bindq  ‘list  (cons  (fust ‘list)  (rest  (rest ‘list)))) 
(bindq ‘moved  T))) 

(T(do  ( bindq  ‘templist  ( cons  ( first  ‘list )  ‘tempbst ) ) 

(bindq ‘list  (rest ‘list) )))))) 

( bindq  ‘list  ( reverse  ‘templist ) ) 

( bmdq  ‘templist  nil ) ) ) 

(return ‘list) 

\(display>  'sort.on.wonh  »"*  debug )  (display ‘list  debug ) 


•COUNTING 


COMMENT. 

ALL  ARRAY  VAUJES  ARE  S  YMBOLIC 
ALL  ARRAY  X&Y  VALUES  START  AT  00 
ROW  00  is  reserved  for  NAMES 
COMMENT; 
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(  COMPARE.ROWS? 

descnpuon.  see  if  c arrow  (for  each  P  col)  is  same  as  given  row 

example  input  x 

example  output  x 

notes:  THIS  SHOULD  ELIMINATE  ANY  ROWS 

WHICH  ALREADY  COMPARE  ••  to  save  some  compute  time 
eliminate  by  simply  putting  the  row  number  on  a  list,  then  check 
the  list  before  comparing  all  the  columns  in  a  row 


c:  COMPARE.ROWS? 
instance.of  flow.pred 
i  take  number  list  symbol  list 

i.give  flag 

arguments  'row  ’cols  'array  'given  \  given  is  a  given  row  of  P  data 

ray.vars  ’truth 

algorithm  ( do  ( bindq  ’truth  T  ) 

( locp.unul  (or?  ( null? ’cols ) 

\  for  every  P  col 

( tot?  ’truth ) ) 

\or  mismatch  found 

( do  ( bindq  ’truth  ( same?  ( array2@  ’array  ’row  ( first  ’cols ) ) 

( first  ’given ) ) ) 

( bind)  ’given  ( rest  ’given ) ) 

( bindq  "cols  ( test  ’cols ))) ) 

( return  ’truth ) ) 


\  ==»==.««=.«««»«»«»— 

{  COUNT.Xi’RlME 

description:  collect  part  non  of  rows  which  look  the  same  based  on  given  set 

example  input:  x 

example  output  list  of  lists  cf  rows  in  array  which  look  the  same 
notes:  x 


c:  COUNT.X  PRIME 


instmce.of  matfi  fune 

list  number  symbol 
list  \  list  of  lists 

•cols  ’max.y  ’array 

•tow  “checked  ’result  *iemp.v*ho!<fcr  ’tcmp.s  *xx 
( do\(  drsplay>  ‘Checking  X"  on  '  perm )  ( display  ’cols  print )  ( display  ’max.y  print ) 
( bindq  ’result  ml )  \  holds  list  of  subsets 
{ bindq  ’checked  nil  )\holds  N  values  that  have  been  counted 
( bindq  ’row  ’max.y ) 

(loopumil  (equal,’rowO)\foieveryrowexceptO 
t  do  \  take  specified  contents  of  row 
( bindq  ’temp.v  ml ) 

( bindq  *xx  ’cols ) 

( loop.unul  ( null?  *xx ) 

\  for  evecy  P  col 

\  for  this  row.  make  a  list  of  Pcol  entries 
( do  ( bindq  ’temps  ( airay?@  ’array  ’row  ( fust  cxx ) ) ) 

( bindq  ’temp.v  ( cons  ’temps  ’temp.v  ) ) 

( bindq  *xx  ( rest  *xx ) ) ) ) 


utake 

i.give 

arguments 

rnyvars 

algorithm 
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\  now  have  a  list  (in  reverse  order)  of  a  given  row 
\see  if  it  has  already  been  checked 
( bindq  “temp.v  ( reverse  “temp.v ) ) 

( bindq  “temps  nil )  \  holds  subset  of  rows 
(if.tnie  (not?  (member?  “temp.v  “checked)) 

( do  \  not  checked  yet,  keep  track  of  it 

( bindq  “checked  ( cons  “temp.v  “checked ) ) 
(bindq “holder  ’row) 

( loop.unnl  ( equal?  “holder  0 ) 

\  for  every  row  above  this  row,  slopping  row  0 
(  do  \  gather  this  row 
(if.tnie 

( comparejows? 

“holder  “cols  “array  “temp.v ) 

(  bindq  *temp.s 

( cons  “holder  *temp.s ) ) ) 

(bindq “holder (subl  “holder)))))) 

( if.true  ( notnull?  “temps ) 

( bindq  “result  ( cons  ( reverse  “temps )  “result ) ) ) 
(bindq “row (subl  “row)))) 

( return  ( reverse  “result ) ) ) 


(  INDP.PRIME 

description:  collect  paranon  of  rows  which  look  the  same  based  on  P  (given) 

example  input:  x 

example  output  list  of  lists  of  rows  in  array  which  look  the  same 
notes:  P  is  a  list  of  columns 

) 


c:  IN'OJCPRIME 

instance.of  math.func 

l.take  list  number  symbol 

1  give  list  \  list  of  lists 

arguments  *pset*max.y  “airay 

algonihm  ( count.x.pnme  “pset  “max.y  “array ) 


(  N.PR1ME 

desenpuon:  collect  partinon  based  on  N  (given) 

example  input:  list  of  cols  called  N,  raaxffrows  in  array,  array 
example  output  list  of  lists  of  tows  in  array  which  have  same  N  value 
notes:  find  all  members  of  N  which  have  the  same  value 


c:  NPRIME 

instance  of  math.func 

i.take  list  number  symbol 

i.gtve  list  \  list  of  lists 

arguments  “cols  “max.,  “array 

algorithm  ( counox.pnme  “cols  *max.y  “array ) 
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(  LOWER.IND.P 

description:  collect  partition  based  on  subset  of  N  in  md.p  prime 

example  input  x 

example  output  list  of  rows  in  array 

notes;  RESULT  IS  NO  LONGER  ORDERED 


c:  LOWERINDR 

instance. of  maih.func 

lXake  list  list  number  \NRRIME  subset,  INDPRRIME  list 

tgive  list  \  list  of  rows 

arguments  •n.primej  *ind-prime  ‘beta 

my.vars  ‘result 

algorithm  ( do  ( bindq  ‘result  ml ) 

( loop  until  ( null?  *ind.prime )  \  for  every  n.prime 
(do (if. true  ( greater.than? 

(calc. beta  ( first ‘md-prime ) ‘uptimes ) 

•beta) 

(bindq  ‘result  ( concat  ( copyuist  ( first  ‘ind-prime ) ) 


•result))) 


( bindq  ‘rndprime  ( rest  ‘indpnme ) ) )  ) 
\(  displays-  "LOWERJND(P)"  log  ) 

\(  displays  ‘mprimei  log  )  ( displays  ‘result  log ) 
(return ‘result)) 


\=.  ■■■!«— 
(  POS.P.N 


description:  x 

example  uipur  x 
example  output 

notes:  RESULT  IS  NO  LONGER  ORDERED 


c:  POS.P.N 

instance  of  math.func 

irake  list  list  number  \N_PRIME  'ist  INDPJ'RIME  list 

igive  list  \  union 

arguments  ‘n  prune  ‘ind.pnme  ‘beta 

my.vars  ‘result 

algorithm  ( do  ( bindq  ‘rt  suit  ml) 

( loop.until  ( null1  *n  prime ) 

(do  (bindq  ‘result 

( list-union  ( lower.intlp  ( fust  •  n.prime )  ‘rndprime  ‘beu ) 
•result)) 

( bindq  ‘n.prime  ( rest  ‘n.pnme  ) ) ) ) 

\(  displays  ,POS(P,NK  log )  ( displays  ‘result  log ) 

( return  ‘result ) ) 


\-  ■■■  1  - == 

(  KJLN 

description:  dependency  of  N  on  P 

example  input:  x 
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example  output  x 
notes:  x 


) 

:xfy(symsym~sym) 
uueger>  float 
integer?  float  fswap 
f/pfloaung; 

c:  FQUOTENT 

instance  of  tnath.func 
i.take  number  number 
l.give  number 

forth  %x£f 


c:  KP.N 

instance.of  raath.func 

i.take  list  list  number  number 

l.give  number 

arguments  'n  pnme  'indpnme  •cruverse  'beta 

my.vars  'result 

algonthm  (do  (bindq 'result  0.0) 

( bindq  'result  ( fquotient  ( length  ( pos-pm  'n.prime  'ind.pnme  'beta ) ) 
•universe)) 

(return  'result)) 


\=1 - u  . . . 

(  DOROUG'rLSETS 

descnpnon:  the  big  routine 

example  input:  x 
example  output  x 
notes:  x 

) 


c:  DOROUGRSETS 
instance.of  flow.func 
Make  list  list  number  symbol  number 

(give  number  \k(p,n) 

arguments  'pset  'nset  'umverse  'beta 
my.vars  *n.prime  'ind-prime  'kpn 

algonthm  (do\(display>  "DOING  ROUGH  SETS"  print) 

\(  displays-  -*#######**#* DOfNC-  ROUGH  SETS"  log ) 

\( displays  'pset log) 

( bindq 'mprime  ( nprirae 'nset  'universe  'array)) 

\(  display?  "N= "  print )  ( display  'n.pnme  print ) 

\(  display?  1T="  log )  ( display?  'mprirne  log  ) 

( bindq  'indprime  ( iivtxprirae  "pset  'universe  'array ) ) 
\(  display?  "IND(P)'= "  print )  ( display  'iruLpriroe  print ) 

\(  display?  "IND<P)'="  log ) '  display?  'indprime  log ) 

( bindq  'kpn ( fcpn 'n.prime 'tnd.prime  'universe  'beta)) 
( display?  "K(PN)=  "  print )  ( display  'kpn  print ) 

( display?  "K(PN)='  log )  ( display  'kpn  log ) 

(return 'kpn) 

) 
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MAKEPLIST 

sub.of 

function 

my.crcaxor 

sak 

i.take 

number 

tgivc 

list 

arguments 

•num 

my.vars 

•list  'count 

algorithm 

(do  (bindq  ‘list  mi) 

{ bindq  ‘count  1 ) 

( ioop.until  ( greater.than?  ‘count  ‘num ) 

( do  ( bindq ‘list  ( cons ‘count ‘list ) ) 

( bindq  ‘count ( addl  ‘count ))) ) 
( return  ( reverse  ‘list ) ) ) 


c:  FIND.MINSET 
instancc.of 
Lake 
i.gi»e 
arguments 


flow.func 

list  Ust  list  number  number  number 
list 

•pset  ‘nset  ‘universe  *  cunqm  *bea 


\‘cop 

my.vars 

algorithm 


•pseip  ‘new-kpn  *sig 
( do\(  display>  ~FIND.MIN.SET*  pnnt ) 

( Ioop.until  ( null?  ‘cop ) 

(do  (display?  "Deleted  member  "print) 

( display  (vahseof(  first  ‘cop ) 'my.col )  pnnt)  . 

( bindq  *psetp  {  delete  ( valuerrf  ( fust  ‘cop  )  MY.COL  )  ( copyist  pset  1 )  ) 

(  display>  "PSET  log )( display*  ‘pset  log  > 

( bindq  ‘newipo  ( do  rough  sets  ‘psetp  ‘nset ‘universe  ‘beta)) 

( bindq  *sig  ( quotient  ( float  ( minus  ‘cuikpn  ‘new-kpn ) )  ‘cnrkpo ) ) 

( if.true  ( almostequal?  *sig  0 .005  ) 

( do  (dispiay>"  Toss  that  one  away  "pnnt) 

( bindq  ‘pset  ( delete  ( valuers!  ( first  ‘cop  )  MY.COL  )  ‘pset ) ) ) ) 

\(  display?  "PSET  print)  (display ‘pset  print) 

(array2!  ‘array  0(value.of(  first ‘cop  ) 'MY.COL  )*sig) 

(bindq  ‘cop  ( rest ‘cop ))) ) 

( return  ‘pset ) ) 


::  FIND.SIG.VALS 

flow.func 

list  Ust  number  symbol  number 
list  \k(pn) 

•pset  ‘nset  ‘universe  ‘array  ‘curkpn 
•count  ‘test  ‘new-kpn  ‘sig 
( do  ( display?  "FIND.SIG.VALS"  print ) 

(bindq ‘count ‘pset) 

(bindq ‘test ‘pset) 

( Ioop.until  ( null?  ‘count ) 

( do  ( display?  "Deleted  member"  prim ) 

( diplay  ( fust ‘count )  print ) 

(bindq  ‘test  ( delete  ( first  ‘count )( copy  Jist  ‘pset )) ) 

(display ‘test  print) 

( bindq  ‘ncwkpn  ( dorough-seis  ‘test  ‘nset  ‘universe  ‘anay)) 

( bindq  ‘sig  ( quotient  ( float  ( mmus  ‘cuikpn  ‘newipo ) )  ‘tuiipo ) ) 
( anay2!  ‘array  0  ( fust  ‘count )  *sig ) 

( bindq  ‘count  ( rest  ‘count ))) ) 


instancenf 

irake 

igive 

arguments 

my.vars 

algorithm 
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(return ’pset)) 


c:  SORT&EVALitULES 

instance.of  flow.func 

i  «v>  list  Ust  number  symbol 

tgive  list  \k(pm) 

arguments  *miss  ’nset  •universe  ’array 

my  .vats  ’cop  ’pset  ’curkpn  ’psetp  ’newJqm  ’sig 

algorithm  ( do  ( displays  'SORT&EVAL.RULES'  pnrn ) 

(bindq ’cop ’rules) 

(display ’cop  print) 

(bindq ’pset  (make.plist(  length  ’rules))) 

\( displays  "pset, nset, universe, array" print)  .  . 

\(  display  ’pset  print )  ( display  *nset  print )  ( display  ’universe  print )  ( display  array  print) 
( bindq  ’curtain  ( dououghjets  ’pset  ’nset  •  universe  ’array  )  ) 

(if.troc  (not?  (equal? ’curkpn  0))  -  ^  , 

( do  ( bindq  •pset  ( findminjet  ’cop  ’pset  ’nset  ’universe  ’array  ’curkpn  )  ) 

\(  bindq  •pset  ( fmdsig.vals  ’pset  ’nset  ’universe  ’array  ’curkpn)) 

(  displays  "MINIMAL  SET  print )  ( display  ’pset  print )  ( display  ’curkpn  print ) 

( displays  "MINIMAL  SET  log )  ( displays  ’pset  log ) 

(displays  "Hss  K.P.N.  or  log )  ( displays  •curkpn  log ) 

( return  *psct)) 


COMMENT: 

BIG  CODE  THAT  NEEDS  TO  BE  BROKEN  DOWN 


(  HND.CORE-SET 

description:  an  even  bigger  routine 

example  input  x 

example  output  x 

notes:  x 


c:  HND.OORE5ET 
instance -of  flow.func 
irake  Ust  list  number  symbol 
i.give  none  \k(pn) 

arguments  ’pset  ’nset  ’universe  ’array 

my.vars  ’curkpn  ’core  ’xpset  ’pien  ’xkpn  ’del  ’ten  ’core  ’ramset 
algorithm  ( do  (  displays  "FINDING  COR£  SET"  print ) 

\  pset  and  nset  are  the  entire  universe  oo  fust  pass 

( bindq  'curkpn  ( dojough  jets  ’pset  ’nset  ’universe  ’array  )  ) 

\now,  one  at  a  time,  pick  vbls  out  of  pset 
(bindq ’core  nil) 

(bindq ’plea  0) 

( bindq  *!en  ( length  *pset ) ) 

(loop-umil  ( equal? *plen  ’ten) 

( do  ( bindq  ’xpset  ( copyiist ’pset ) ) 

( bindq  ’del  ( nth  ’pset  *plen  )) 

\  (  displays  "ABOUT  TO  DELETE "  print )( display  ’del  print ) 

\  ( displays  "firom "  print )( display ’xpset  print ) 
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(bindq 'Jtkpn  ( do .rough jets 

( delete  'del  'xpset )  'nset 


'universe 'array ) )  ,  .  _ . . 

\  now,  tell  about  this  columns  worth 
( displays  'COL= '  log )  ( display  ( srray2@  'anay  0  *del )  log ) 

( display  ( minus  •curkpn  'xkpn )  log ) 

\now  save  this  difference  in  the  fame  of  the  col 
\  bigger  is  better! 

( set-value  ( array&S  'array  0  'del )  ’WORTH 
( minus  'curicpo  'Jtkpn') ) 

\if  the  col  is  needed  to  maintain  'curkpn.  it’s  part  of  ■core"  set 
(if.tme  ( less.than? 'Jtkpn 'curkpn) 

(bind? 'core  (cons 'del 'cere))) 

( disp!ay>  -CORE=  '  pnnt )  ( display  »p]en  print )  ( display  'core  print ) 
( display  *pset  print) 

( bindq *plen ( addl  *plen)))) 

( display>  "SORTING  CORE  SET:  *  print ) 

( bindq  ’core  ( son.cols.on. worth  'core  'array ) ) 

( display  ’cote  print  1 

( display>  "CORE  SET=‘  log )  ( displays  'core  tog ) 

\now.  test  cote  set 

\core  set  is  intersect  of  all  minimum  sets 

( bindq 'mmset'pset)  \the  minimum  default  set 
(if.tnie  (notnultf'core) 

( do  ( displays  "NOW  TESTING  CORE  SET"  log ) 

( bindq  'xkpn  ( do  .rough  jets  'core  'nset  'universe  'array ) ) 

( display 'akpn  pnnt) 

(displays 'Jtkpn  log) 

\  now  off  to  find  minimum  sets 
\but.. 

\  if  the  core  set  gets  a  full  k(pn),  then  the  core  set  is  the  mmset 

(cond  ( ( less-than? 'akpn  •curkpn ) 

( bindq  'minset  _ ... 

( findminsets *pset 'nset 'universe 'array 'core  curkpn))) 
( T  ( bindq  'minset  'core )))) ) 

(  displays  'MIN  SETS= "  print )  ( display  'minset  print ) 

( displays  'MIN  SETS="  log  )  ( displays  'minset  log  ) 

) 

COMMENT: 
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( 

EVOLVE.T 

build  a  population  of  protein  evaluation  rales  using  the  G  A  approach. 

} 


{  CKOOSE-SET3Y.  WORTH 

description:  Given  the  current  population,  randomly  choose  a  member,  weighted  by  the  cunemworth 

value  o f  every  member  of  the  population 
example  input:  ( con_101  con_202  coo  _303  _  ) 

example  output:  con_4242 

notes:  uses  mdtiloar  instead  of  m&mod  to  avoid  problems  caused  by  calling  rndtnodwith  large  arguments 


c:  CHOOSE.SET.BY.WORTH 
sub.of  function 

Luke  list 

i.give  symbol 

arguments  'population 

my.vars  'my.pop  'totaLval  'num  'set  'val  'this,  worth 

algonthm  (do 

(  bindq  'my.pop  ^population ) 

(bindq 'totaLval  0) 

(bindq 'set  nil) 

(locp.until(nu!l?'mypop)  \@@@<&  get  total  worth  of  population 
(do 

( bindq  'totaLval  ( plus  'totaLval  ( valuexrf  (  first  'my.pop )  WORTH ) ) ) 

( bindq  'my.pop  (  rest  'raypop ) )  ) ) 

( bindq  *num  ( int  ( times  ( nxLQoat )  'totaLval ) ) ) 

\@@@@  choose  random  value 
( bindq  'val  0 ) 

(ifjrue  (equal? 'totaLval  0)  \@@@@  if  no  sets  with  worth,  grab  ooe  at  random 

( bindq 'set  ( member 'population )) ) 

( loop-unul  ( notnull?  'set )  \@@@<?  wait  till  got  ooe 

(do 

( bindq  'this. worth  ( vaIue.of  ( fast  “population )  WORTH ) ) 

(  bindq  'val  (  plus  'val  'this. worth ) ) 

\@@@@  add  worths  till  over  tandora  val 

( ifdrue  ( grtater.than?  'val  *num  ) 

( bindq  'set  ( first  "population ) ) ) 

( bindq  "population  ( rest  'population ) ) ) ) 

(rearm 'set)) 


(  CROSSRULES 

description:  Given  a  rule  list,  randomly  choose  two  different  rules  from  it,  and  switch  a  portion  of  their 

RES  slots,  returning  the  new  nrles  stuck  into  the  old  rule  list 
example  input  ( con_l  con.2  coo.3  ) 
example  output  ( cco_l  ccn_3  con_4  „ ) 

notes:  does  not  change  the  thaupredia  slot -maybe  it  could/should?  NO 


c:  CROSSRULES 
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sub.of  function 

Unite  list 

(give  list 

teguments  ‘ruleJist 

my.vars  ‘ftrissr  ‘mother  'size  ‘cross.pt  ‘holder  ‘where  "slot  ‘son  •daughter  ♦position 

algorithm  (do 

( bindq  •father  ( member  •ralelist ) ) 

(bindq  •mother  (member ‘ruleJist)) 

(loop.unti](  not?  (same?  ‘father  'mother)) 

\@<§-g<3>  get  two  different  rales 

(bindq*mother(  member  ‘ruleJist))) 

( delete  ‘mother  ‘ruleiist ) 

( delete  ‘father  •ralelist ) 

\  ( bindq  •ruleiist  ( removeall  (list  ‘mother  ‘father) ‘ralelist)) 

( bindq  ‘son  ( newrule ) ) 

(  bindq ‘daughter  ( Dewjule  )  ) 

(  bindq  ‘size  ( value. o(  ( (gctuxcocept )  'WINDOW.SIZE  ) ) 

( bindq  ‘aossqit  (  addl  ( radmod  ( subl  ‘size ) ) ) ) 

(bindq ‘where  (addl  ‘otas.pt ) ) 

(bindq  “position  1 ) 

( loop.untjl  (  equal?  ‘position  ‘where  ) 

\@@@@  copy  up  to  switch  point 
(do 

( bindq  ‘slot  ( changeto symbol  ‘position ) ) 

( set  value  *300  ‘slot  ( valueof ‘father  ‘slot) ) 

( setvalue  ‘daughter  ‘slot  (valueof ‘mother  ‘slot) ) 

( bindq  *posiiion  ( addl  ‘position ))) ) 

( loop.ur.til  (  greater  .than?  “position ‘size ) 

\@@@@  switch  after  that 
(do 

( bindq  ‘slot  (  change,  to  symbol  ‘position ) ) 

( setvalue  ‘son  ‘slot  ( valueof  ‘mother  ‘slot ) ) 

( setvalue  ‘daughter  ‘slot  ( valueof ‘father  ‘slot ) ) 

(bindq  *position ( addl  ‘position)))) 

( setvalue  ‘son  THEN  (PREDICT  ( valued  ‘father  THEN  PREDICT ) ) 

(  setvalue  ‘daughter  THEN-PREDICT  ( valueof  ‘mother  THENJREDICT  )  ) 
( bindq  *ralelist(concai(  list  ‘son  ‘daughter  )*nilelist)) 

( return  ‘ruleiist ) ) 


(  MUTATE.ONE.RUIE 


description:  Given  a  rale  list  randomly  choose  ooe  rule,  mutate  it  and  return  the  new  list 

example  input:  ( con_l  coo_2  con_3  _. ) 

example  output:  ( con_l  con.2  con_4 ... ) 
notes: 


c:  MUTATE.ONERULE 
sahnf  function 

Ltake  list 

igive  list 

arguments  ‘ruleJist 

my.van  ‘mutatee  ‘slot  ‘amino  ‘aaJist  *num  •newrule  ‘times 
algorithm  (do 

(  bindq ‘mutatee  ( member ‘ruleiist ) ) 

( delete ‘mutatee ‘ruleiist ) 

( bindq  •newrule  ( newjule ) ) 

I 
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( copy.pJist  ‘mutatee  *newjule ) 

(  bindq  *umes  (  sddl  ( mdjnod  (  quotient  (  vslutof  ( @cnr£onc«pt ) 'WINDOW^IZE )  2 ))) ) 
( loop.unul  ( equal?  ‘times  0  ) 

(do 

( bindq ‘slot  ( mdjnod  ( addl  ( valueof  ( (Scur.concept )  "WINDOW-SIZE  ) ) ) ) 

(cond 

( ( equal?  ‘slot  0 ) 

( set.  value  •new.rule  THEN .PREDICT  ( create-prediction ) ) ) 

(T 

(do 

(  bindq  "slot  ( change,  to  .symbol  ‘slot ) ) 

(  bindq ‘aaJist  nil) 

( bindq *num(addl  (rn&modS))) 

( loop.until  ( equal?  *num  0 ) 

(do 

(  bindq  ‘aaJist  ( eunioo  ( getamino )  ‘aaJist ) ) 

(bindq  *num(subl  •num)))) 

(mddo 

( aet-valne  ‘newrule  ‘slot  **alia ) 

( set-valne  •newaule ‘slot  ( cons 'X ‘aaJist ))))) ) 

(bindq ‘times (still  ‘times)))) 

( bindq  •tuleJist  ( cons  ‘new jule  ‘ruleiisr ) ) 

( return  ‘raletiist)) 


{  CROSSPAJR.SETS 

description:  Given  the  current  population  of  ralesets.  choose  two  sets  and  cross-breed  them,  switching  rules  around 

example  input  ( con_101  con_202  con_303  -. ) 

example  output  (con_101con_303) 

notes:  tries  to  fmd  sets  that  have  different  talents 


c:  CROSS  .PAIR.SETS 
subnf  fnnction 

Ltake  list 

i.give  list 

arguments  ‘cmrpop 

my.vars  ‘father  ‘mother  ‘fatherjoc  ‘mother Joe  ‘son  ‘daughter  ‘fatberjules  ‘motherxules  ‘sonjules 

my.vars  ‘daughwxulcs  ‘this rule  ‘tries 

algorithm  ( do 

( bindq  ‘father  ( choosesctby. worth  ‘currpop ) ) 

( bindq  ‘mother  ( choose.set.by .worth  ‘curr.pop ) ) 

( bindq  ‘tries  0 ) 

( loop-until  ( or?  ( not?  ( same?  ( value-of  ‘father  'BEST)  ( valuexjf  ‘mother  BEST) ) ) 

( gieater.than?  ‘tries  3  )  ) 

(do 

( bindq  ‘mother  ( chcosesetby. worth  ‘currpop ) ) 

(bindq  ‘tries  (add  1  ‘tries)))) 

( if .true  f  greater.thao?  ‘tries  3 ) 

(do 

(bindq ‘tries  0) 

(loop.ujttil(  or?  ( not?  (  same?  (value.of ‘father  WORST )  ( value,  of  ‘mother  "WORST ) ) ) 
( greater.*®?  ‘tries  3 ) ) 

(do 

( bindq  ‘mother  (  chooseset-by  .worth  ‘cunpop  )  ) 

(bindq ‘tries (addl  ‘tries)))))) 

( loop.unul  ( not?  ( same?  ‘father  ‘mother ) ) 
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( bindq  ’mother  ( choose-setby  .worth  'currpop  ) ) ) 

( display>  "Crossing  *  debug  )  (  display  'father  debug ) 

( display "  and “  debug )( display  'mother  debug  ) 

( bindq  'motherjules  ( copy  list  (  value  of 'mother  RULEJJST ) ) ) 

( bindq  ’father  rules  ( copy  list  (  vslueof  'father  "RULEJJST) ) ) 

( bindq  'son  ( aew.set ) ) 

( bindq  'daughter  ( new  set ) ) 

( bindq  'soojules  nil ) 

( bindq  'daughter  jules  nil ) 

( bindq  'fatherJoc  ( addl  ( nsdjnod  ( subl  ( length  'father jules )))) ) 

( bindq  'mother.kx  ( addl  ( rod-rood  ( subl  ( length  'mother rules ) ) ) ) ) 
(  bindq  'soojules  ( coocat  ( grabiirstn  'father jules  'father Joe  ) 

( cbpjist  'motherjules  'mctherJoc ) ) ) 

( bindq  'daughter sales  ( coocat  ( grab.firsLn  'motherjules  'mother doc ) 
( clipiist  'father  jules  'fatherJoc ) )  ) 
{  setvalue  'son  ’RULEJJST  'sotuules ) 

( setvalue  'daughter  RUI.F..IJST  'daughter jules ) 

(  setvalue  'son  ’ACCURACY  nil ) 

(  setvalue  'daughter  ’ACCURACY  nil ) 

( setvalue  'son  CREATOR  CROSS  .PAIR-SETS ) 

( setvalue  'daughter  CREATOR  CROSS  EAIREETS  ) 

( setvalue  'soo  CREATED  'cycles ) 

( setvalue  'daughter  TREATED  'cycles ) 

(  display  ‘to  form  '  debug )  ( display  'son  debug ) 

(  display  ‘  and  '  debug )  ( display  'daughter  debug ) 

(return  (list 'sou 'daughter))) 


(  MUTATE.ONESET 

description:  Given  the  current  population  of  rule  sets,  choose  one  and  mutate  a  percentage  of  its  rules,  as 

determined  by  the  experiment  frame 
example  input  (ceo_101con_202coo_303_) 
example  output  coo_401 
notes: 


c:  MUTATE.ONESET 
subjof  function 

Lake  list 

Lgive  symbol 

arguments  'cttrr.pop 

my.vars  'mutates  'mutation  'range  'percent  'krules  'thejules  ‘chance  'min 
algorithm  (do 

(  bindq  'mutaiee  ( choosejetby  .worth  'currpop ) ) 

(  displays  "Mutating  ”  debug )  ( display  'muatee  debug ) 

( bindq 'thejules  ( copy Jist  ( valuejof 'maatee  RULEJJST  )) ) 

( bindq  'range  ( get  TREED JNFO  ’MUTATELRANGE ) ) 

( bindq  'mm  (  vsluc.of  TREED JNFO  ’MUTATEAHN  ) ) 

( bindq  *petcent  ( int  ( nxLnormal  ( first  'range )  ( second  'range ) ) ) ) 
( bindq  'irules  ( quooent  ( times  ( length  'thejules )  'percent )  100  ) ) 
( if.troe  ( lest,  than?  '#ndes  'mm ) 

( bindq  **rules  'rain ) ) 

( loop.unttl  ( equal?  'fruits  0  ) 

(do 

(  bindq  'thejules  ( mdjJo 

(cross rules  'thejules) 

( motitr.  one  rale  'thejules ) 
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( cons  ( inventrale )  ‘therales ) 

\  (delete  (member ‘therales) ‘therales) 

)) 

( bindq  ‘#rules  ( subl  "tales  )  ) ) ) 

(  bindq  ‘mutation  ( new  jet ) ) 

( set. value  ‘mutation  'RULE. LIST  ‘therales ) 

( seL value  ‘mutation  'ACCURACY  nil ) 

( set  value  ‘mutation  ’CREATOR  MUTATE  .ONE  .SET  ) 

(  set.value  ‘mutation  'CREATED  ‘cycles ) 

( display  "to  form "  debug )  ( display  ‘mutanon  debug ) 

( return  ‘mutation ) ) 


(  REPRODUCE.ONESET 

descnpnon:  Given  the  currant  por«latioo  of  role  sets,  choose  one 

example  input  (con_101con„202con_303«. ) 
example  output:  coo_101 


) 


c:  REPRODUCE.ONESET 
sub.of  function 

i.take  list 

Lgive  symbol 

arguments  ‘currpop 

my.vais  ‘parent  ‘rales  ‘new  rales  ‘thisrale  ‘chance 
algonthm  (do 

(  bindq  *parent  (  choose  seLby.woith ‘curr.pop  ) ) 

( display>  'Reproducing  ’  debug  )  (  display  ‘parent  debug ) 

\  (bindq ‘newrales  nil) 

\  (  bindq  ‘rales  ( valae.of  ‘parent  RULEXIST  )  ) 

\  (bindq ‘chance  (radfloat)) 

\  ( if.true  ( less-than?  ‘chance  0  JO ) 

\  ( bindq  *rales(  reverse  ‘rales))) 

\  (bindq ‘spot(rodraod(  length ‘rales))) 

\  (  bindq  ‘newrales  ( concat  ( clipEst  ‘rules  ‘spot )  ( grab.firsut  ‘rules  ‘spot ) ) ) 

\  (loop.until(nuU?  ‘rales)  \@@@@  scrimble  cader  of  rules  for  more  variety 

\  ( rjo  in  cross-breeding 

\  (bindq ‘thisrale  ( member  *rales)) 

\  (  bindq  *newrales(  eons  ‘thisrale  ‘newrales)) 

\  (delete ‘thisrale ‘rales))) 

\  ( set. value  •parent  RULEUST  ‘rales ) 

(return ‘parent)) 


{  REPRODUCED  ESTEET 

description:  Create  a  copy  of  the  best  set  ever,  and  return  it 

example  input: 

example  output  cor.,5454 

notes:  x 


c:  REPRODUCEDESTEET 
sub.of  function 

Ltahe  none 
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i.give 

my.vars 

algorithm 


symbol 

•parent  ‘rules  ‘newrales  ‘this -rule 
(do 

( bindq  *parent  ( new  jet ) ) 

( bindq  ‘rales  ( vaIue.of  ‘bestseuver  RULELIST ) ) 
( bindq  ‘new rales  nil ) 

( loop.until  ( null?  *rales ) 


(do 

(bindq ‘thisrale  (newrale)) 

(  copy.pJist  ( first ‘rales ) ‘thisrale ) 

( bindq  ‘newrales  ( cons  ‘thisrale  ‘newrales ) ) 

( bindq  *reles  ( rest  ‘rales )))  > 

(  set-value  “parent  RULEJLIST  ‘newrales ) 

( set-value  “parent  'ACCURACY  ( value-of  ‘bestseuver  'ACCURACY ) ) 

( set-value  ‘parent  'CREATOR  (value  .of  ‘bestseuver  CREATOR  )  ) 

( set-value  ‘parent  'CREATED  ( valued  ‘bestseuver  'CREATED ) ) 

(  set-value  *parent  HELK-MATT  ( valueof  ‘bestseuver  'HEUXMATT  )  ) 

( setvalue  •parent  'SHEET.MATT  (  valued  ‘bestseuver  'SHEET.MATT  >  ) 

(  setvalue  ‘parent  RAN.MATT  ( value.of  ‘bestseuver  RAN-MATT  ) ) 

( setvalue  •parent  TOTALJvlATT  ( valued  “bestseuver  TOTAL-MATT )  ) 

( setvalue  •parent  PERCERED  (  value-of  ‘bestseuver  "PERC-PRED ) ) 

(  setvalue  •parent  TNDIV  ACCURACY  ( value-of  •bestseuver  TNDIV ACCURACY ) ) 
( setvalue  •parent  'BEST  ( value.of  ‘bestseuver  REST  ) ) 

( setvalue  *oareni  'WORST  ( value.of  ‘bestseuver  WORST ) ) 


( return  •parent)) 


l  REPRODUCE.BEST.WORTH 


description:  Creaitacopyofdiebestwonbsetever.andrtturr.it 

example  input 
example  output  con_S454 
notes:  x 


c:  REPRODUCE-BEST.WORTH 
sub.of  function 

i.talee  none 

i.give  symbol 

my.vars  •parent ’rales ‘newrales ‘thisrale 
algorithm  (do 

( bindq  ‘parent  ( new  jet ) ) 

( bindq  ‘rales  ( value.of  ‘bestwonltsetever  RULELIST ) ) 

( bindq  ‘newrales  nil ) 

( loop.unul  ( null?  ‘rules ) 

(do 

( bindq  ‘thisrale  ( newrale ) ) 

(  copy -p-list  ( first  ‘rales )  ‘thisrale ) 

(  bindq  ‘newrales  ( cons  ‘thisrale  ‘newrales  )  ) 

( bindq  *rales  ( rest  ‘rales ))) ) 

( setvalue  *parent  RULELIST  ‘newrales) 

( setvalue  •parent  'ACCURACY  ( valueof  ‘bestwonltseuver  ACCURACY  )  ) 

( setvalue  ‘parent  'CREATOR  ( valuextf  ‘bestwonltsetever  'CREATOR  )  ) 

( setvalue  ‘parent  'CREATED  ( valueof  ‘bestwonltsetever  CREATED  ) ) 

( setvalue  *parent  HELDC-MATT  ( value-of  ‘bestwonh-seuver  'HEUX-MATT  )  ) 

( setvalue  •parent  'SHEET.MATT  ( value-of  ‘bestwonltsetever  SHEET -MATT  )  ) 
(setvalue  ‘parent  RAN-MATT  ( valued  ‘bestwonltsetever  RAN-MATT)) 

( setvalue  *parent  TOTAL-MATT  ( valutof  ‘best  worth  setever  TOTALMATT  )  ) 
( setvalue  “parent  "FERC-PRED  ( valuextf  ‘bestwonh  jetever  RERCJRED  )  ) 
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( setvalue  ’parent  TNDIV  ACCURACY  ( value.of  ’best,  worth  .setever  TNDIV  ACCURACY ) ) 
{ setvalue  ’parent  'BEST  ( valutof  ’bestwotth-setever  BEST ) ) 

( setvalue  ’parent  'WORST  ( value.of  ’bestworth.set.ever  WORST) ) 

(return ’parent)) 


(  EVOLVEJOP 

description:  Given  the  current  population,  and  the  best  score  ever,  evolve  the  population,  ressurecting  the 

best  sm  ever  ( areuracy  and/or  wotth  )  if  we  seem  to  be  on  the  wrong  track 
example  input:  ( con_101  con_202  coo_303 ... )  .55 

example  output  ( con.101  con_202  ccn_303  -. ) 
notes:  what  should  the  divine  intervention  criteria  be? 

updates  the  dead  set,  list 


c:  EVOLVE  .POP 
sub.of  function 

i.take  list  number  number 

Lgive  list 

arguments  *curr.pop  ’best-score  ’bestwonh 

my.vars  ’random  ’crossover  ’reproduce  ’mutate  ’new.set  *  dead.se  is  ’new.pop  ’deatLlist  ’help 

algomhm  ( do 

(  bindq  ’random  (  value.of  "BREED. INFO  '((RANDOM  )  ) 

(  bindq  ’crossover  ( value.of  BREED  JNFO  '((CROSSOVER ) ) 

(  bindq  ’reproduce  ( valuer*  BREED  JNFO  '((REPRODUCE ) ) 

(  bindq  ’mutate  (  value.of  BREED.INFO  '((MUTATE  )  ) 

(bindq ’help  T) 

(  bindq  ’new.pop  nil ) 

( if-true  ( Iess.than?  ( quotient  *best  worth  ( float  ’besLwonh.ever ) )  .85  / 

(do 

( bindq  ’new.pop  (  cons  (  reproduce-bestworth )  ’new.pop ) ) 

( bindq ’reproduce  ( subl  ’reproduce)) 

(bindq  ’help  T))) 

( if.true  ( and?  ( less  .than?  ( quotient  ’bestscore  •bestever)  .80 )  ’help ) 

(do 

( bmdq  ’new.pop  ( cons  ( reproduce.bestsM )  ’new.pop ) ) 

( bindq  ’reproduce  ( subl  ’reproduce ) ) ) ) 

( loop,unul  ( equal?  ’random  0 ) 

(do 

( bindq  ’new.pop  ( cons  ( createnew.se! )  ’new.pop ) ) 

( bindq  ’random  ( subl  ’random  ) ) ) ) 

( loop.until  ( equal?  ’crossover  0 ) 

(do 

( bindq  ’new.pop  ( concat  ( crosshair  acts  ’curr.pop )  ’new.pop ) ) 

( bindq  ’crossover  ( minus  ’crossover  2 ) ) ) ) 

( loop.unul  ( equal?  ’mutate  0 ) 

(do 

( bindq  ’new.pop  ( cons  ( mutate  .one  set  ’curr.pop )  ’new.pop ) ) 

( bindq  ’mutate  ( subl  ’mutate  ) ) ) ) 

( loop.urail  ( equal?  ’reprodtKe  0 ) 

(do 

( bmdq  ’newset  ( reproduce  .onejel  ’cure.pop ) ) 

(bindq  ’curr.pop  ( removeall  ( list  ’new set )  ’curr.pop ) ) 

( bindq  ’new.pop  ( cons  ’ikwsm  ’new.pop ) ) 

{ bindq ’reproduce  ( subl ’reproduce ))) ) 

( bindq  ’deaisets  ( removeall  ( cons  ’bestwonhsetever  ( cons  ’bestsetever  ’new.pop )  1 
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( valued  •RULE^ET  ‘SUBS ) ) ) 

(bindfl  tig  ( value.nf  TCILLJWFO 'DEAD^ETXIST ) ) 

( set  value  TCDLUNFO  DEAD.SETUST  ( listunion  ♦deari.list  *  dead,  sets ) ) 
( bindq  'dealsets  ( retnoveall  •deadJist  “deal  sets ) ) 

( loop.umil  ( null?  ‘deaistJs ) 

(do 

( idllirame  (  fust ‘deadjeu )  > 

( bindq  *dea(tsets  ( rest ‘deaisets  ))) ) 

(reOun*new.pop)) 
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